Big Data Text Analysis
There are three types of word association analyses that can be applied to texts gathered by Mozdeh: These detect words that associate with:
It can be useful to identify the words that are unusually common in posts that match a particular query/filter because they may point to important aspects of the topic discussed. This can be achieved as follows.
In the example below the query is spring and no filters were selected. At the top of the list, ignoring spring, the term roll associates with spring because 23.4% of the texts containing "spring" also contain "roll". In contrast, 0.7% of the remaining posts contain "roll". The chisquare is a statistical test of this association.
The chi square value indicates how statistically significant the difference is between the frequency of the terms in the topic-specific collection of texts compared to all texts collected, with higher values indicating more significant results.
The above methods find words that associate with one term and/or filters compared to the rest of the project. You might want to compare a set of queries against each other. To do this, select the Association mining comparisons tab and enter your queries in the new text box, separated by commas. Then click Compare words matching the above queries (slow). The results will be saved into a set of files (not displayed onscreen) that need to be loaded into a spreadsheet to be analysed. The list below will find words that associate with bus in comparison to train, with bus in comparison to car, and train in comparison to car.
After loading into a spreadsheet and formatting, you might see results like below. This is for the query like [1 62] against the query like [63 67]. It is a comparison of terms associating with the term like earlier posts (months 1 to 62) against words associating with it in later posts (months 63 to 67). Square boxes can be used to delimit dates with queries in this way.
To find words that associate with male or female commenters for a query, enter the query in the box above, check the Compare male vs. female for each query and click Compare words matching the above queries (slow).
To find words that associate with recent or old posts, enter the start and end number of the dates to be searched in square brackets in the query as below. As usual, to run the comparisons, click Compare words matching the above queries (slow). In the example below the comparison is between texts containing the term room in the first 20 dates against terms containing room in dates 50 to 100.
Date numbers can be found in the First date to show drop-down box in the Advanced Search tab. In the example below, the fifth date is 8 November 2003 (UK date ordering).
To identify words that associate with your entire project, a generic list of word frequencies is needed that the topic words can be compared against. This should be a plain text file with a list of words and their frequencies from a common, generic collection of texts. A file of word frequencies for a large collection of UK and Ireland tweets will be used unless you have your own. Click the Load Word Freq List Reference Set button on the bottom right of the screen, follow the instructions, clear the search box, and click Mine associations for search and/or filters (slow). The results will appear in the text box below it.