Big Data Text Analysis

Home -- Download -- Instructions -- FAQ

Word Association Analyses for time

Word association analysis is introduced here. This page describes two ways to apply it for time.

How is time specified in Mozdeh?

Time ranges must be specified by number in square brackets. Date numbers can be found in the First date to show drop-down box in the Advanced Search tab. In the example below, the fifth date is 8 November 2003 (UK date ordering). From the picture below [4 5] indicates all dates from 28/7/2003 to 8/11/2003. Similarly, [1 3] indicates all dates in 2002. Whether the dates in a project are hours, days, months or years depend on the options chosen when first processing the project. To use a different time length (e.g., days instead of months), create a new project and use Mozdeh's Import Data button, importing the same texts but choosing the new period.

Time range A vs. the rest: Words more common in a given period than in the remaining texts

It can be useful to identify the words that are unusually common in posts that match a particular query/filter because they may point to important aspects of the topic discussed. This splits the data into two parts: A and the rest. It finds words that occur more often in A than in the rest of the texts. This is achieved as follows.

  1. Enter the time query (e.g., [15 26]) in the normal top-left search box and clear all filters. Look up the numbers in the First date to show drop-down box, as explained above.
  2. Click Mine associations for search and/or filters (slow).

The results will be displayed below the word associations button.

The example below is for the period [1 7] indicating a week of tweets 10-16 March 2020. At the top of the list the term coronavirus associates with 10-16 March 2020 because 21.9% of the 10-16 March 2020 tweets contain coronavirus compared to 16.5% of the remaining tweets. The chi-square is a statistical test of this association, and the stars indicate that the difference is statistically significant (unlikely to be due to normal random variations). Perhaps more interesting is the greater focus on things being cancelled at during 10-16 March 2020 than later on. Note that even though some of the percentage differences are tiny (e.g., 0.9% vs. 0.4% for toilet) they are still significant. The words are depluralised, by the way, so school includes both school and schools, and classe really refers to classes.

Time range A vs. Time range B

To compare two time periods (e.g., 2020 against 2017, ignoring 2018 and 2019):

1) Select the Association mining comparisons tab.

2) Enter the two periods in the new text box, separated by a comma (e.g., [2 7], [9 15]).

3) Click Compare words matching the above queries (slow).

The results will be saved into a set of files (not displayed on screen) that can be loaded into a spreadsheet to be viewed. The queries below compare period [1 20] against period [50 100]. This example goes one step further by only comparing texts in these two periods that contain the term room. In other words the comparison is between texts containing the term room in the first 20 dates against terms containing room in dates 50 to 100.

After loading into a spreadsheet and formatting, you might see results like below. This is for like [1 62] against like [63 67]. It finds terms associating with like in earlier posts (months 1 to 62) against terms associating with it in later posts (months 63 to 67). For example, the term view was in 4.8% of posts from [63 67] but only 1.3% of earlier posts from [1 62].

Made by the University of Wolverhampton during the CREEN and CyberEmotions EU projects and updated at the University of Sheffield.