To give an impression of the variety of text mining methods and their possiblities, this page lists some of the more common text mining methods:
In general, text mining methods can be rule-based or based on machine learning. Many of the methods below can be approached in both ways.
Although not strictly a research method in itself, text preprocessing is essential for text mining. It includes all the steps that are taken to make a text suitable for text analysis.
Common preprocessing methods are:
Again, programming languages can be helpful. Popular libraries are NLTK (Python) and tm (R), but there are many more.
Frequently, text analysis methods use statistics about words and texts to discover and explore linguistic patterns.
Some frequently used text facts and metrics:
These questions may be answered using a programming language such as Python or R (for instance, Python's NLTK library), but many out-of-the-box tools, such as Voyant or AntConc, can do these things as well.
Sentiment analysis is the classification of texts according to the emotions and attitudes they express. For instance, texts may be classified as 'positive', 'neutral', or 'negative', or receive a score to indicate how strongly positive or negative they are. Other methods can specify the particular emotions represented, or classify texts as being more subjective or objective.
These types of analyses could be helpful in researching attitudes towards certain events, people, companies or topics in, for example, news items, review texts, or social media posts.
Sentiment analysis is frequently done using programming languages such as Python or R. Some commonly-used libraries/packages are NLTK (Python), Syuzhet (R) or SentimentAnalysis (R).
Some ready-to-use text and data mining software, such as Orange and RapidMiner, also includes sentiment analysis functionality.
Text mining methods can be used to classify texts into certain categories, based on the words and language used in them. For instance, one could assign news items to categories such as 'politics', 'culture', 'economy', 'sports', etc. This can be done using rule-based methods (i.e. based on counting specific words) or machine-learning methods (using training data labelled with the relevant categories).
Similar, but not quite the same, is topic modelling, a machine learning method that categorises and clusters texts into different groups. It identifies clusters of words that often appear together and that may be thought of as belonging to a coherent 'topic'. Every text is then labelled according to the topics that are present (often with a percentage that corresponds to how strongly the topic is represented in the text). This can help researchers find subject clusters and text categories within a collection of texts, and identify texts that are semantically similar.
Named Entity Recognition (short: NER) is the task of automatically identifying and extracting named entities from texts. Examples of named entities are persons, cities, countries, organisations, brands, and more. Named Entity Recognition can either be rule-based, depending on predefined word (i.e. name) lists, or based on a machine learning model that is trained on a set of texts with labelled entities.
NER is often done with the help of programming languages. Some relevant libraries in Python are SpaCy, flair and NLTK; in R, popular libraries are SpaCyR and OpenNLP. A ready to use NER tool is the Stanford Named Entity Recognizer.
Part-of-speech tagging (or: POS tagging) means identifying and tagging the grammatical category of words within a text (e.g.: noun, adjective, verb, preposition, etc.).
Automatic part-of-speech tagging requires advanced forms of natural language processing, because often the grammatical category of a word can only be identified by interpreting the context and the syntactical structure of a sentence. For instance, 'mine' can be both a verb, a noun, or a possessive pronoun, based on the context.
POS tagging is often done to enable more advanced linguistic analysis of a text corpus, or to enable using part-of-speech information to extract certain words and statistics from a corpus.
Part-of-speech tagging can be done using programming languages such as Python or R, but also ready-to-use tools can be found for certain languages.