Skip to Main Content

Text mining

Corpora, databases, tools, methods and further reading.

Methods

To give an impression of the variety of text mining methods and their possiblities, this page lists some of the more common text mining methods:

In general, text mining methods can be rule-based or based on machine learning. Many of the methods below can be approached in both ways. 

  • Rule-based methods rely on fixed, pre-defined rules to generate results, for instance based on pre-defined word lists and strict if-X-then-Y-statements. These methods can be more transparent, but may not be flexible enough for all purposes.
  • Machine learning methods use training algorithms to discover patterns in a labeled training set of texts, which can then be applied to new texts. For instance, a machine learning model may 'learn' to recognize personal names based on their context, form, and usage, even if these specific names were not present in the original training set. Machine learning methods can be very flexible and powerful, but are less predictable and transparent than rule-based methods and may require more effort to prepare.

Text preparation methods (preprocessing)

Although not strictly a research method in itself, text preprocessing is essential for text mining. It includes all the steps that are taken to make a text suitable for text analysis.

Common preprocessing methods are:

  • Converting files to a plain text format.
  • Transforming text files into word lists (the 'bag-of-words' approach).
  • Removing punctuation marks and/or capitalization.
  • Removing stop words (i.e. function words with relatively little semantical content, such as 'the', 'which', 'and').
  • Lemmatization: clustering words to their lemma (dictionary form) (i.e.: 'breaking', 'breaks', 'broke' --> 'break').
  • Stemming: reducing words to their root / stem (i.e. 'capability', 'capableness', 'capably' --> 'capabl').

Again, programming languages can be helpful. Popular libraries are NLTK (Python) and tm (R), but there are many more.

Word and text metrics

Frequently, text analysis methods use statistics about words and texts to discover and explore linguistic patterns.

Some frequently used text facts and metrics:

  • Word frequency: Which words occur often in a text or corpus?
  • Distinctive words: Which words are distinctive for a text compared to other texts within a corpus?
  • Word dispersion: Where does a word appear within a text, and how evenly distributed is a word within a text or corpus?
  • Word co-occurrence: Which words appear frequently together? How strong is this correlation?
  • Concordance: In what contexts (surrounding words) does a word appear?
  • Lexical density: How many unique words are used within a text compared to the total amount of words (i.e.: how linguistically rich is the language)?
  • Average sentence length: How long is the average sentence of a text? How does this compare to other texts?

These questions may be answered using a programming language such as Python or R (for instance, Python's NLTK library), but many out-of-the-box tools, such as Voyant or AntConc, can do these things as well.

Sentiment analysis

Sentiment analysis is the classification of texts according to the emotions and attitudes they express. For instance, texts may be classified as 'positive', 'neutral', or 'negative', or receive a score to indicate how strongly positive or negative they are. Other methods can specify the particular emotions represented, or classify texts as being more subjective or objective.

These types of analyses could be helpful in researching attitudes towards certain events, people, companies or topics in, for example, news items, review texts, or social media posts.

Sentiment analysis is frequently done using programming languages such as Python or R. Some commonly-used libraries/packages are NLTK (Python), Syuzhet (R) or SentimentAnalysis (R).

Some ready-to-use text and data mining software, such as Orange and RapidMiner, also includes sentiment analysis functionality.

Text classification & topic modelling

Text mining methods can be used to classify texts into certain categories, based on the words and language used in them. For instance, one could assign news items to categories such as 'politics', 'culture', 'economy', 'sports', etc. This can be done using rule-based methods (i.e. based on counting specific words) or machine-learning methods (using training data labelled with the relevant categories).

Similar, but not quite the same, is topic modelling, a machine learning method that categorises and clusters texts into different groups. It identifies clusters of words that often appear together and that may be thought of as belonging to a coherent 'topic'. Every text is then labelled according to the topics that are present (often with a percentage that corresponds to how strongly the topic is represented in the text). This can help researchers find subject clusters and text categories within a collection of texts, and identify texts that are semantically similar.

Named entity recognition (NER)

Named Entity Recognition (short: NER) is the task of automatically identifying and extracting named entities from texts. Examples of named entities are persons, cities, countries, organisations, brands, and more. Named Entity Recognition can either be rule-based, depending on predefined word (i.e. name) lists, or based on a machine learning model that is trained on a set of texts with labelled entities.

NER is often done with the help of programming languages. Some relevant libraries in Python are SpaCy, flair and NLTK; in R, popular libraries are SpaCyR and OpenNLP. A ready to use NER tool is the Stanford Named Entity Recognizer.

Part-of-speech tagging

Part-of-speech tagging (or: POS tagging) means identifying and tagging the grammatical category of words within a text (e.g.: noun, adjective, verb, preposition, etc.).

Automatic part-of-speech tagging requires advanced forms of natural language processing, because often the grammatical category of a word can only be identified by interpreting the context and the syntactical structure of a sentence. For instance, 'mine' can be both a verb, a noun, or a possessive pronoun, based on the context.

POS tagging is often done to enable more advanced linguistic analysis of a text corpus, or to enable using part-of-speech information to extract certain words and statistics from a corpus.

Part-of-speech tagging can be done using programming languages such as Python or R, but also ready-to-use tools can be found for certain languages.