Big Data Text Analysis
Mozdeh has built in procedures to help you to mark and remove spam content from a set of tweets that has already been collected.
Collect the tweets, load them into Mozdeh, allow Mozdeh to index them and then go to the start screen. Run a blank search (i.e., clear the search box) and then click the Spam Filtering tab below the search box.
This gives lots of options for marking and processing spam.
To mark individual tweets as spam, first click the spam marking option Register All Clicked Search Results in Box Above as Spam (marked by arrow 1 below) and then click any spam result in the list box above it. When you click a spam result, it will be marked with "s" in the results box (as shown by arrow 2 above). You can click as many results as you like and if you navigate to other pages then all your results will be remembered. If you are marking a lot of results as Spam then please click the Save Spam List button occasionally in case of a program crash. Browse your data by repeatedly clicking the next button and when you see a spam item, click on it to mark it. If you accidentally click the wrong item, click the option below arrow 1 above and click your item again to clear it.
If you identify a pattern, such as that all texts containing "@stephenfry" are spam then you can specify this by clicking the Mark all current search results as Spam after running the search, as shown below.
It is possible to mark all search results as not spam. For example, you might want to mark all search results containing the word holiday as spam except those containing the word riot. To do this, first mark all search results for holiday as spam, as described above. Then run a query for holiday AND riot (to match all texts containing both words) and click the button Mark all current search results as Not Spam button.
Before marking a list of duplicates as spam, make a list of the duplicates in case they are needed for future analyses. To do this, click the Save tab (1 below), then click the Save frequency list of matches button (2), clear the search text box (3) and run this blank search by clicking the Boolean Search button (4).
If you are using any automated analyses, such as trend detection or the word frequency table, click the Mark duplicate tweets as spam button. This will mark all duplicate texts as spam so that there will never be two identical texts in the results.
To work with spam-free texts, it is best to create a new project after excluding all spam. To do this after completing the spam identification, click the Make new Spam free project button, close Mozdeh and then reopen the new project.
Now follow from the third step the instructions to analyse the tweets.
|Made by the Statistical Cybermetrics Research Group at the University of Wolverhampton during the CREEN and CyberEmotions EU projects.|