Mozdeh

Big Data Text Analysis

Home -- Download -- Instructions -- FAQ

YouTube Comment Term Frequency Comparison (CTFC) method

This is a method to investigate a broad topic (e.g., dancing) with many subtopics (e.g., dance styles) by analysing the words used in comments on relevant videos.

The CTFC method

The CTFC method comprises both data gathering and analysis. Whilst the complete method involves many different types of analyses, a particular application can ignore irrelevant ones. This method is fully described in an academic paper that is currently being reviewed. This page supports this paper with specific instructions for the software.

Note that Step 1.4 and 1.6 can now be conducted through Mozdeh instead of Webometric Analyst, with the YouTube data collection tab in the Mozdeh startup wizard.

CTFC Step 1: Data gathering and filtering

If you are studying one or more YouTube channels then steps 1-5 can be ignored and instead enter the channel ID(s) in the YouTube query interface and select the Channel IDs option in the Data Collection dialog box (Step 6)..

  1. Topic definition and delineation: Not automated.
  2. Initial subtopic query set generation: Not automated.
  3. Subtopic query testing and refinement: Not automated.
  4. Video list generation: (a) Enter the subtopic queries into a plain text file, one per line. (b) Download Webometric Analyst from http://lexiurl.wlv.ac.uk to a Windows computer. (c) Sign up for a YouTube API key for permission to submit automatic YouTube queries http://lexiurl.wlv.ac.uk/searcher/YouTubeKeyRegister.html. (d) start Webometric Analyst, close the Wizard, and select the YouTube tab in the main interface. (e) Click Search for Videos Matching Each Query in File, select the plain text file of subtopic queries and wait for it to finish. Delete all of the files produced except the one of title matches (i.e., where the queries match the title of the videos).
  5. Video list checking: Not automated.
  6. Comment downloading: (a) Start Webometric Analyst, close the Wizard, and select the YouTube tab in the main interface. (b) Click Get YouTube Comments for List of Video IDs, select the filtered file of query title matches and wait for YouTube to deliver all the comments. This may take days. ***This can all be done in Mozdeh instead of Webometric Analyst now.
  7. Duplicate commenter removal: (a) Download Mozdeh from http://mozdeh.wlv.ac.uk and start it to get as far as the New Project screen before closing it. This configures folders on your computer. (b) Start Webometric Analyst, close the Wizard, and select the YouTube tab in the main interface. (c) Click Convert YouTube Comments to Mozdeh Format, select the filtered file of query title matches (filename ending in TM.txt or TM), and select the option to process a maximum of one comment per user. This exports the comments to a Mozdeh project.
  8. Comment pre-processing: Start Mozdeh, select the new project of YouTube comments created by Webometric Analyst and wait for Mozdeh to ingest the comments ready for analysis.
  9. Language filtering (English): YouTube comments can be written in any language but the word frequency analysis relies upon the terms being in the same language. The next stage is therefore to filter out comments that may not be in the chosen language. A simple way to achieve this approximately is to exclude all comments that do not contain any terms that are common in the selected language and rare in others (Grefenstette, 1995). For English, the following list matches this specification: as; had; he; his; it; that; the; to; was; with; you; are; my; their; this; thank; for; she; who; why; where; when; how; love; hate; awesome; great; amazing; more. This list was created by comparing terms that were most frequent in common YouTube languages and keeping the terms that occurred in the English list but not the other lists. Words that were commonly used in other languages or borrowed within bilingual sentences were excluded (e.g., we, like). (a) Start Mozdeh and open the YouTube comment project. (b) Select the Mozdeh Save tab, click Make New Project from Search Matches. (c) Enter the terms following query in the search box (without the square brackets): [as had he his it that the to was with you are my their this thank for she who why where when how love hate awesome great amazing more]. (d) Click Boolean Search to run the search and create a second project with the matching queries. The second project should now be used instead of the first.

The this completes the creation of a Mozdeh project with predominantly English comments on each video.

CTFC Step 2: Time series graph

This produces a time series graph of all the comments. (a) Start Mozdeh and load the English version of the project. (b) From the Analyse menu, select the Graph Time Series submenu. (c) Enter a blank search and click Create Graph with Boolean Search. (d) To save the graph to a file, click Show Graph Formatting Options and then click Print Graph, and select a printer that will save the results to a file (e.g., pdf, Microsoft document format – most computers have one of these – a commercial product produces particularly good results: www.peernet.com/conversion-software/pdf-to-tiff-converter/). [Select Search from the Analyse menu to return to the main screen.]

CTFC Step 3: Subtopic word frequency analysis

This produces a list of terms that associate with each subtopic compared to the others. First, start Mozdeh and load the English version of the project. Then complete the following steps for each subtopic. (a) Select the subtopic query from the topic/label box. (b) Entering a blank search and click Boolean Search. (c) Click the Calculate Word Frequencies for all Search Matches button. (d) Copy the results (in a text box on the right of the screen) to a spreadsheet by (i) right clicking in the word frequencies box, (ii) clicking Select All, (iii) right clicking in the word frequencies box, (iv) clicking Copy and (v) switching to the spreadsheet and pasting the text to it.

CTFC Step 4: Gender analysis

For the whole project:
(a) Start Mozdeh and load the English version of the project. (b) Click the Advanced Search Tab if it is not already visible. Select Female from the User Gender drop-down box. (c) Enter a blank search and click Boolean Search. (d) Click the Calculate Word Frequencies for all Search Matches button. (e) Copy the results (in a text box on the right of the screen) to a spreadsheet by (i) right clicking in the word frequencies box, (ii) clicking Select All, (iii) right clicking in the word frequencies box, (iv) clicking Copy and (v) switching to the spreadsheet and pasting the text to it. (f) repeat the above from b) with Male selected in the User Gender drop-down box.

CTFC Step 4a: Subproject compilation

Before running the subproject gender analysis (or the subproject sentiment analysis) a subproject data file must be built in Mozdeh for each subproject, as follows. The following steps must be conducted for each subproject. (a) From the Subprojects menu, click [Use all data – ignore all subprojects]. (b) enter a blank search in the search box at the top left hand corner of the screen. (c) Click the Save tab and check the option Make Subproject From Search Matches option. (d) Click Boolean search and then enter a name for the subproject (without quotes) in the dialog box. This has created a file listing all tweets associated with the subproject.

CTFC Step 4b: Subproject gender analysis

For each subproject the process is the same as above except with a subproject selected (new step c):
(a) Start Mozdeh and load the English version of the project. (b) Click the Advanced Search Tab if it is not already visible. Select Female from the User Gender drop-down box. (c) From the Subprojects menu, click Select Subproject. Click on the subproject name (possibly ending in .dat) in the new dialog box. [until the subproject is changed, all future operations apply only to texts in that subproject] (d) Enter a blank search and click Boolean Search. (e) Click the Calculate Word Frequencies for all Search Matches button. (f) Copy the results (in a text box on the right of the screen) to a spreadsheet by (i) right clicking in the word frequencies box, (ii) clicking Select All, (iii) right clicking in the word frequencies box, (iv) clicking Copy and (v) switching to the spreadsheet and pasting the text to it. (g) repeat the above from b) with Male selected in the User Gender drop-down box.

CTFC Step 5: Sentiment analysis

For the whole project:
(a) Start Mozdeh and load the English version of the project. (b) Click the + button in the sentiment section of the main interface to restrict the results to tweets that are at least moderately positive and not moderately negative. (c) Enter a blank search and click Boolean Search. (d) Click the Calculate Word Frequencies for all Search Matches button. (e) Copy the results (in a text box on the right of the screen) to a spreadsheet by (i) right clicking in the word frequencies box, (ii) clicking Select All, (iii) right clicking in the word frequencies box, (iv) clicking Copy and (v) switching to the spreadsheet and pasting the text to it. (f) repeat the above from b) except clicking the – button to the right of the + button.

CTFC Step 5a: Subproject compilation

Conduct Step 4a unless it has already been done.

CTFC Step 5b: Subproject sentiment analysis

For each subproject the process is the same as above except with a subproject selected (new step c):
(a) Start Mozdeh and load the English version of the project. (b) Click the Advanced Search Tab if it is not already visible. Click the + button in the sentiment section of the main interface to restrict the results to tweets that are at least moderately positive and not moderately negative. (c) From the Subprojects menu, click Select Subproject. Click on the subproject name (possibly ending in .dat) in the new dialog box. [until the subproject is changed, all future operations apply only to texts in that subproject] (d) Enter a blank search and click Boolean Search. (e) Click the Calculate Word Frequencies for all Search Matches button. (f) Copy the results (in a text box on the right of the screen) to a spreadsheet by (i) right clicking in the word frequencies box, (ii) clicking Select All, (iii) right clicking in the word frequencies box, (iv) clicking Copy and (v) switching to the spreadsheet and pasting the text to it. (g) repeat the above from b) except clicking the – button to the right of the + button.

CTFC Step 6: Overview network

This creates networks of subtopics. (a) Start Mozdeh and load the English version of the project. (b) From the Network menu click Make networks of post similarity between labels.
Made by the Statistical Cybermetrics Research Group at the University of Wolverhampton during the CREEN and CyberEmotions EU projects.