Big Data Text Analysis
Mozdeh can calculate the average number of retweets/likes/citations (depending on your data) for the texts that match any query. To do this, select Calculate geometric mean retweet counts and confidence intervals from the Analyse menu and then run the search as normal.
The average and confidence intervals will appear to the right of the search results, as below.
Geometric mean retweets/ratings/scores 0.8565 (0.7649, 0.9528)
This means that the average number of retweets is 0.8565 with a 95% confidence interval of (0.7649, 0.9528).
This can be used to compare individual words, genders or sentiments for their retweet/rating/citation count power. For example, a set of UKIP tweets had an average retweet count of 0.8565 with a 95% confidence interval of (0.7649, 0.9528) for males and 1.8966 (1.5390, 2.3046) for females, giving statistical evidence (because the confidence intervals do not overlap) that there is a gender difference in retweeting.
This can be achived in two stages. First, select retweet count above zero, say 10, and set this as the miminum retweets.
Next, click the Mine Associations for Search and Filters button. This will list words that occur more often in tweets that have been retweeted at least 10 times. These are the candidates for highly retweeted words. For example in the UKIP dataset, one of these terms is #paulforstoke. To find the average retweet count for these terms, reset the minimum number of retweets to 0, enter a term as a query and click the search button. For #paulforstoke this gives 34.3182 with a 95% confidence interval of (13.4156, 85.5295) for the 17 matches. Thus, on average, tweets containing #paulforstoke got 34 retweets.
The geometric mean is used instead of the arithmetic mean becuase the counts are likely to be highly skewed and the geometric mean is a better measure of average for this type of data.
Here is an example. Confidence intervals are calculated using the t-distribution formula on the log transformed counts.