Understanding the relative difference metric
Relative Insight’s comparative approach to text analytics surfaces statistically significant differences and similarities between language sets. In doing so, our software directs your attention to the things that actually matter in a body of text.
The Relative difference metric is a measure of how much more prevalent a topic, phrase, word, emotion or grammar element is in one body of text compared to others. The platform also displays frequency and similarity metrics.
How is relative difference calculated?
For each language set uploaded into the platform, Relative Insight conducts a detailed linguistic analysis. Our natural language processing algorithms ‘read’ the text, identifying topics, grammar, emotions, words and phrases. The frequencies of each language element are then determined and normalized based on the size of the language set to enable ‘apples to apples’ comparisons between language sets of different sizes.
Relative difference is calculated by dividing the normalized frequency of a particular language element in one language set by the normalized frequency of the same element in the comparison language set(s).
Where the relative difference calculation reveals a difference, the platform applies an additional layer of statistical analysis to provide confidence that the difference is not surfacing due to chance. Log-likelihood calculations are performed to assess this possibility, and the output of the analysis viewable within the platform will only display differences that meet a 99% confidence interval. This means that there will be a maximum of 1% chance that a difference was identified where one does not truly exist.
Why should I trust insights based on low frequencies?
This is one of the most common questions we get from new users of Relative Insight.
The frequency of word usage follows what is called a Zipf distribution. This statistical law dictates that the frequency of a word is inversely proportional to its rank in the frequency table. Put simply, this means the second most common word will appear half as often as the most common, the third one third as often and so on. Because of this, most words are expected to occur very infrequently and thus even a few occurrences can result in a statistically significant finding.
The nature of dealing with words
Words are less precise than numbers. This means that even the most advanced text analysis solution may surface findings that don’t make perfect sense. Relative Insight is no exception. For example, consider the word ‘spring’ which has context-specific meanings as a verb, to describe a season or in reference to a mechanical component. This can pose a challenge when it comes to topical classifications. The ability to view verbatim examples from the text can help you overcome this and better understand the data you have analyzed when things may not be immediately clear.
If you ever need help making sense of something in the analysis, our team will be here to help!