Big Data Text Analysis
For keen computer scientists here is a sample of our old RSS raw data (228 Mb). It is data collected from 22 December 2004 to 4 February 2005 by monitoring a few thousand RSS feeds (later our set of RSS feeds increased to almost 70,000 and then to about 100,000). Unzip to see the data which is contained in a set of daily XML-like text files. Each file contains a list of all the new RSS items collected on the day and each item is marked with a unique item ID and an ID of the feed from which it was collected. Sorry, it is difficult to use even with the Mozdeh manual - so please only try it if you are really keen.
Here is a sample of the vocabulary from some RSS feeds (32M zip file of blogs, news and other sites). The file includes: