Mozdeh

Big Data Text Analysis

Home -- Download -- Instructions -- FAQ

Word Association Thematic Analysis - Overview and Mozdeh instructions

Word association thematic analysis (WATA) identifies themes that occur more often in one subset of texts than another. For example, given a set of Covid-19 tweets it might find that male-oriented themes include sports and news, female-oriented themes include personal safety and mental health, and nonbinary-oriented themes include politics and identity.

How does it work? For any set of texts, you set filters in Mozdeh to split the texts into two non-overlapping sets, based on gender, country, label, sentiment, retweet count, or a query. Mozdeh then finds words occurring more often in one set than another. You then read lots of texts containing each word to identify its typical context and then use thematic analysis methods to group the words into themes. Full details are given in the following book.

Thelwall, M. (2021). Word association thematic analysis: A social media text exploration strategy. San Rafael, CA: Morgan & Claypool.

Examples of publications using WATA

The table below illustrates some WATA studies.

Topic

Data

Comparison

Example findings

Gender differences in reactions to Covid-19

Tweets mentioning Covid-19

Female v. male

Females tweet more about safety, males more about politics (Thelwall & Thelwall, 2020).

Personal experiences of ADHD

Tweets about “my ADHD”

ADHD v. other disorders

The brain is discussed as if it is a separate entity (Thelwall, et al., 2021a).

Evolution of #BlackLivesMatter during Covid-19

Covid-19 tweets about racism

Tweets in four different periods.

The George Floyd killing led to tweets about systematic racism (Thelwall & Thelwall, 2021).

Self-presentation on Twitter

UK Twitter profiles

Female v. male v. nonbinary

Nonbinary profiles more likely to mention games and sexuality (Thelwall et al., 2021b).

Autism on Twitter

US autism tweets during Covid-19.

Autism v. others

Autistic tweeters do not have distinctive reactions to Covid-19 (Thelwall & Thelwall, submitted).

Gender differences in museum interests

Comments on YouTube museum videos

Female v. male

Females are more explicitly positive about content (Thelwall, 2018c).

Discussions of bullying in YouTube

Comments on YouTube influencer videos

Bullying v.
Others

Strategies used to address bullying include generalisation (Thelwall & Cash, to appear).

Interests on Reddit

Reddit posts

Female v. male

Females more likely to mention doctors in the science subreddit (Thelwall & Stuart, 2019).

Factors associated with success in SteemIt

Steemit (like Reddit) posts

Successful v. unsuccessful posts

Financial news is less likely to be rewarded (Thelwall, 2018b).

Nursing research

Nursing journal articles*

USA v. other countries

Nursing administration and management is not studied in some countries (Thelwall & Mas-Bleda, in press).

US research subjects

US journal articles*

Female v. male

Lists of gendered research topics and styles (Thelwall, et al., 2019b).

UK research subjects

UK journal articles*

Female v. male

Lists of gendered research topics and styles (Thelwall et al., 2020).

Indian research subjects

Indian journal articles*

Female v. male

Lists of gendered research topics and styles (Thelwall, et al., 2019a).

References to papers using Word Association Thematic Analysis (not necessarily using that term).

Thelwall, M., Abdoli, M., Lebiedziewicz, A. & Bailey, C. (2020). Gender disparities in UK research publishing: Differences between fields, methods and topics. El Profesional de la Información, 29(4), e290415. https://doi.org/10.3145/epi.2020.jul.15
Thelwall, M., Bailey, C., Makita, M., Sud, P. & Madalli, D. (2019b). Gender and Research Publishing in India: Uniformly high inequality? Journal of Informetrics, 13(1), 118–131.
Thelwall, M., Bailey, C., Tobin, C. & Bradshaw, N. (2019a). Gender differences in research areas, methods and topics: Can people and thing orientations explain the results? Journal of Informetrics, 13(1), 149-169.
Thelwall, M. & Cash, S. (to appear). Bullying discussions in UK female influencers’ YouTube comments. British Journal of Guidance and Counselling. https://doi.org/10.1080/03069885.2021.1901263
Thelwall, M., Makita, M., Mas-Bleda, A. & Stuart, E. (2021a). “My ADHD hellbrain”: A Twitter data science perspective on a behavioural disorder. Journal of Data and Information Science, 6(1). https://doi.org/10.2478/jdis-2021-0007
Thelwall, M. & Mas-Bleda, A. (2018). YouTube science channel video presenters and comments: Female friendly or vestiges of sexism? Aslib Journal of Information Management, 70(1), 28-46.
Thelwall, M. & Mas-Bleda, A. (in press). How does nursing research differ internationally? A bibliometric analysis of six countries. International Journal of Nursing Practice. https://doi.org/10.1111/ijn.12851
Thelwall, M. & Stuart, E. (2019). She’s Reddit: A source of statistically significant gendered interest information? Information Processing & Management, 56(4), 1543-1558.
Thelwall, M. & Thelwall, S. (2020). Covid-19 tweeting in English: Gender differences. El Profesional de la Información, 29(3), e290301.
Thelwall, M., & Thelwall, S. (2021). Racism discussions on Twitter after George Floyd during Covid-19: A space to address systematic and institutionalized racism? Social Science Research Network. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3764867
Thelwall, M., Thelwall, S. & Fairclough, R. (2021b). Male, female and nonbinary differences in UK Twitter user self-descriptions: A fine-grained systematic exploration. Journal of Data and Information Science, 6(2), 1-27.
Thelwall, M., & Thelwall, S. (submitted). Autism Spectrum Disorder on Twitter during Covid-19: Account types, self-descriptions and tweeting themes.
Thelwall, M. (2018a). Social media analytics for YouTube comments: Potential and limitations. International Journal of Social Research Methodology, 21(3), 303-316.
Thelwall, M. (2018b). Can social news websites pay for content and curation? The SteemIt cryptocurrency model. Journal of Information Science, 44(6), 736–751.
Thelwall, M. (2018c). Can museums find male or female audiences online with YouTube? Aslib Journal of Information Management, 70(5), 481-497.

See also: (word association analysis, but not full WATA):

Thelwall, M. (2021). World Food Day on Twitter 2009-2020: Driven by UNFAO and aligned campaigns. SSRN

Examples of identifying the context of words and their themes

The following spreadsheets give artificial examples of words identified by Mozdeh and suggested human assigned contexts and themes for them. These are small-scale examples for trainign purposes. Most of the papers above have online supplements with longer lists of words with contexts and themes.

The rest of this page gives instructions for getting the data for this with Mozdeh.

Instructions for Mozdeh

Step 1: Collect your data (tweets, YouTube comments, other) with Mozdeh in the same way as for any other Mozdeh project. Go to the main Mozdeh search screen when you have finished.

Step 2: Decide what type of comparison you are making. If you are comparing the texts matching a filter against the rest (e.g., female-authored tweets against all other tweets) then follow the version of step 3 for a one-vs-remainder word assocation test (3a). If you are comparing one set of texts against another, but not the rest (e.g., nonbinary-authored tweets against female-authored tweets) then follow the version of step 3 for a A-vs-B word association test (3b).

Step 3a: Enter filters in the search screen to match your set (e.g, gender, country...), check the Always save mine associations results... option in the Advanced menu and click the Mine Associations for Seach and Filters (slow) button. This should produce a file containing a list of words occuring more often in the set matching your filters than in the remaining texts. When a row is starred at the end, this means that the difference is statistically significant. The results are in a file in the folder that will appear when the procedure is finished. This list is sorted in descending order of statistical significance.

Step 3b. Select the Association mining comparisons tab. Enter two queries in the text box, following the instructions below. The two queries specify the two subsets to be compared (or enter one query and select the Male vs. Female option). Here are some examples.

* The queries nonbinary,transgender are an instruction to compare texts containing "nonbinary" with texts containing "transgender".

* The queries <N>,<F> are an instruction to compare texts authored by nonbinary people with texts authored by females.

* The queries <M>{{UK}},<F>{{UK}} are an instruction to compare texts from the UK authored by males with texts from the UK authored by females.

* The queries our{{UK}},our{{USA}} are an instruction to compare texts from the UK containing the word "our" with texts from the USA containing the word "our".

Now click the Compare words matching the above queries (slow) button. This should produce a file containing a list of words occuring more often in texts matching the first query than texts matching the second (and vice versa). When a row is starred at the end, this means that the difference is statistically significant. The results are in a file in the folder that will appear when the procedure is finished. This list is sorted in descending order of statistical significance.

Step 4: Configure the filters on the search screen for the first of the two queries compared, so that all texts to classify match the original queries. In the 3a case, keep the filters and/or queries used in step 3a. In the 3b examples, the following would be set.

* Enter nonbinary in the search box.

* Select nonbinary in the gender selection box.

* Select male in the gender selection box and UK in the country selection box.

* Enter our in the search box and UK in the country selection box.

Step 5: Click on the Save tab. Click the WATA button and select the file created by stage 3a or 3b (the version ending in diffp or diffp.txt). In reply to the questions, select column 1 and answer Yes to the question about search screen filters (unless you don't need to use them). This will produce a file that can be loaded into a spreadsheet that contains 100 randomly selected texts containing each of the first 100 words found by the word association tests (unless you changed the recommended answers).

Step 6a: Skip step 6.

Step 6b: Repeat steps 4 and 5, altering the filters in Step 4 to match those in the second query. For example, in the 3b cases this would be:

* Enter transgender in the search box.

* Select nonbinary in the gender selection box.

* Select female in the gender selection box and UK in the country selection box.

* Enter our in the search box and USA in the country selection box.

After Step 5, you don't need Mozdeh any more. Use the Step 4 file for the thematic analysis stage to find themes in the words found by the word association analysis. For example, if one of the top 100 word assocation words is "racist" then the file will contain 100 texts including the word "racist" and the context of the use of this word can be deduced by reading them. Repeating this for the other words and clustering the word contexts as part of a thematic analysis might put "racist" with an "Abuse" theme or a "Politics" theme, for example. Important: If you followed 3b/6b then you will have two files, one for each of the two queries. Only classify texts for a word from the file using the word more often. For example, if the term stupid is used more often by UK females than UK males for the third pair of queries above then text containing stupid should be classified from the first file (UK females) and the texts in the second file (UK males) should be ignored.

 

Made by the Statistical Cybermetrics Research Group at the University of Wolverhampton during the CREEN and CyberEmotions EU projects.