Skip to Main Content

Text mining

Corpora, databases, tools, methods and further reading.

Corpora and databases

Existing materials/corpora

In this LibGuide we refer to text mining corpora and databases:  

  • A corpus is a set of texts that you are analyzing. In this LibGuide, we collected a few corpora to give you an idea of the possibilities of using text mining for research purposes. 

  • A text database stores textual content that you can use to create corpora for text mining analyses. 

Not all text databases and corpora are suitable for text analysis due to different aspects; copyright, privacy, the type of file* and quality of text documentation**.


*The type of file must be supported by the text mining tool you are using.  
**The text must be a complete representation of the text you are studying.  

 

Creating corpora

Sometimes there is no existing corpus that will address your research question, so you will have to create your own corpus. When preparing your corpus it might be interesting to have a look at text preprocessing methods. It includes all the steps that are taken to make a text suitable for text analysis.