What is text mining? What are its potential applications and limitations?
Text mining is about looking for patterns in natural language text, and may be defined as the process of analyzing text to extract information from it for particular purposes. It recognizes that complete understanding of natural language text, a long-standing goal of computer science, is not immediately attainable and focuses on extracting a small amount of information from text with high reliability.
Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. It is different from what we’re familiar with in web search. In search, the user is typically looking for something that is already known and has been written by someone else. The problem is pushing aside all the material that currently isn’t relevant to your needs in order to find the relevant information.
To get farther though we need more sophisticated language analysis. A number of us are working on statistical techniques that try to assign semantics, or meaning, to parts of the text. We break off pieces of the problem of analysis, targetted towards particular applications, rather than trying to “read” the articles as a whole. This goal is especially promising in the biosciences due to the nature of the text itself. In some ways it is easier to process automatically than ordinary text. It is less ambiguous and the processes it describes are somewhat mechanical, and so representable in a computer.
The fundamental limitations of text mining are first, that we will not be able to write programs that fully interpret text for a very long time, and second, that the information one needs is often not recorded in textual form. If I tried to write a program that detected when a where a new word came into existence and how it spread by analyzing web pages, I would miss important clues relating to usage in spoken conversations, email, on the radio and TV, and so on. Similarly, If I tried to write a program that processes published documents in order to guess what will happen to a bill in Washington DC, I would fail because most of the action still happens in negotiations behind closed doors.
Information sources:
- What Is Text Mining?. By Martu Hearst SIMS, UC Berkeley. (2003). Retrieved: 11:48, May 5, 2008, from http://people.ischool.berkeley.edu/~hearst/text-mining.html
- Untangling Text Data Mining. By Marti A. Hearts. School of Information Management & Systems. (1999). Retrieved: 11:58, 5 May, 2008 from http://people.ischool.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html
- What is Text Mining? (2002). Retrieved: 12:06, May 5, 2008, from http://www.cs.waikato.ac.nz/~nzdl/textmining/