Big Data Text Analysis
Since Chinese does not include spaces between words, Mozdeh uses a helper program to insert the spaces for the analyses. This is a Java program. Your computer will need to run Java programs to use this software. Please pilot test this several times on a few minutes of data each first to make sure that it works smoothly on your computer. Please practice using Mozdeh first with English to get used to it and then switch to Chinese.
These steps should all happen automatically but this information may help if anything goes wrong.
Start Mozdeh and enter a project name (English only please - Chinese characters might crash the text segmentation step).
Enter query terms in Chinese (one per line, any number of lines) and select zh as the language code, then click to start collecting.
When you have finished collecting tweets, click Stop Monitoring.
Accept the option to filter tweets (optional).
Mozdeh will now download a zipfile of java programs, the Stanford NLP Toolkit for Chinese (thank you to The Stanford Natural Language Processing Group) and save it to the moz_data folder containing your projects. This is a big file and may take a few minutes to download. A new "command" window will then open and attempt to unzip this file to access the program. Follow the instructions if the unzipping fails. Mozdeh will now run this unzipped program on your Chinese tweets. If you do not have Java on your system then you will get an error message like: 'java' is not recognized as an internal or external command, operable program or batch file in this window (see below). If you see this message then you need to install Java on your computer and start again with a new project.
If you don't see an error message then follow the instructions in the command box (press any key).
Next click OK, make sure that 3 is selected for the language group in the next box and click OK to the next boxes. This may take a long time - days if you have collected weeks of data.
Finally, click Search to show your data and notice that there are spaces between words. The spaces may not always be correct. Read other parts of this website to find how to analyse this text.
|Made by the Statistical Cybermetrics Research Group at the University of Wolverhampton during the CREEN and CyberEmotions EU projects.|