Fuentes web
Entradas
Comentarios

What is Text Mining?

What is text mining? What are its potential applications and limitations?

Text mining is about looking for patterns in natural language text, and may be defined as the process of analyzing text to extract information from it for particular purposes. It recognizes that complete understanding of natural language text, a long-standing goal of computer science, is not immediately attainable and focuses on extracting a small amount of information from text with high reliability.

Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. It is different from what we’re familiar with in web search. In search, the user is typically looking for something that is already known and has been written by someone else. The problem is pushing aside all the material that currently isn’t relevant to your needs in order to find the relevant information.

To get farther though we need more sophisticated language analysis. A number of us are working on statistical techniques that try to assign semantics, or meaning, to parts of the text. We break off pieces of the problem of analysis, targetted towards particular applications, rather than trying to “read” the articles as a whole. This goal is especially promising in the biosciences due to the nature of the text itself. In some ways it is easier to process automatically than ordinary text. It is less ambiguous and the processes it describes are somewhat mechanical, and so representable in a computer.

The fundamental limitations of text mining are first, that we will not be able to write programs that fully interpret text for a very long time, and second, that the information one needs is often not recorded in textual form. If I tried to write a program that detected when a where a new word came into existence and how it spread by analyzing web pages, I would miss important clues relating to usage in spoken conversations, email, on the radio and TV, and so on. Similarly, If I tried to write a program that processes published documents in order to guess what will happen to a bill in Washington DC, I would fail because most of the action still happens in negotiations behind closed doors.

 

Information sources:

This time, we are going to continue analizing HLT, we are going to focus some research topics mentioned before and we are going to see what theirs labours consist in.

To begin with, we could mention the German Research Centre of artificial Intelligence which elaborates these themes in research, development and commercial projects, its works consist in:

  1. exploiting – and automatically extending – ontologies for content processing
  2. tighter integration of shallow and deep techniques in processing
  3. enriching deep processing with statistical methods
  4. combining language checking with structuring tools in document authoring
  5. document indexing for German and English
  6. automatically associating recognized information with related information and thus building up collective knowledge
  7. automatically structuring and visualizing extracted information
  8. processing information encoded in multiple languages, among them Chinese and Japanese

On the other hand, we could talk about the HKUST Human Language Technology Center in Hong Kong. Its labour consist in:

  1. specializing in speech and signal processing.
  2. statistical and corpus-based natural language processing.
  3. machine translation.
  4. text mining.
  5. information extraction.
  6. Chinese language processing, knowledge management, and related fields.
  7. Special emphasis is given to machine processing of Chinese language and Chinese information

Finally, we should talk about one of the most important centres. It is the Australian Research for Artificial Intelligence. Its duty consist in:

  1. Typed unification-based grammar formalisms 
  2. Development of a HPSG-based grammar for German 
  3. Natural Language Generation 
  4. Speech Synthesis 
  5. Computational Morphology

 Information sources:

Today, we´ll, again, analize the differences and the mistakes between a translated text and a real one. We can see how the these programs work and how they can hep us in many cases.

This time, we have a text written in portugues which will be translated into spanihs. We can see that there are errors, but in general the text is quite good. In that text we see, that the main errors refers to accentuated words, to articles, which are a bit different in portuges, as well as, some verbs, but in general we can say that the result is rather good.

Albert Einstein foi um físico alemão radicado nos Estados Unidos mais conhecido por desenvolver a teoria da relatividade. Ganhou o Prémio Nobel da Física de 1921; no entanto, o prémio só foi anunciado em 1922. Em 2005 celebrou-se o Ano Internacional da Física, em comemoração dos 100 anos do chamado “Annus Mirabilis”  de Einstein, em que este publicou quatro dos mais importantes artigos cientifícos da física do século XX. Em sua honra, foi atribuído o seu nome a uma unidade usada na fotoquímica, o einstein, bem como a um elemento químico, o einstênio.

Albert Einstein fue un físico alemán radicado en Estados Unidos más conocido por desarrollar la teoría de la relatividad. Ganó el Prémio Nobel de la Física de 1921; sin embargo, el prémio sólo fue anunciado en 1922. En 2005 se celebró el Año Internacional de la Física, en celebración de los 100 años del llamado “Annus Mirabilis”  de Einstein, en que este publicó cuatro de los más importantes artículos cientifícos de la física del siglo XX. En su honra, fue atribuido el su nombre a una unidad usada en la fotoquímica, el einstein, bien como el un elemento químico, el einstênio.

On the other hand, if we do the same but with a text written in english we´ll see that the difference is muxh more bigger and the result worse. If have a text in Englihs and thet is the result:

Albert Einstein  was a German-born theoretical physicist. He is best known for his theory of relativity and specifically mass–energy equivalence, E = mc2. Einstein received the 1921 Nobel Prize in Physics “for his services to Theoretical Physics, and especially for his discovery of the law of the photoelectric effect.”

Albert Einstein era un físico teórico de forma alemana nacido. Es más conocido para|por su teoría de la relatividad y específicamente equivalencia de energía masiva, E = mc2. Einstein recibía al 1921 Nobel Prize en Física para|por sus servicios a Física Teórica, y especialmente para|por su descubrimiento de la ley del effect.

We now, do see that the traduccion is quite worse. The errors and mistakes, as well as, more than one possiblities are commun and the result is not quite good to take as enough. There are wrong words, wrong expresions, words that haven´t been translated ant etc..

 

 

Information sources:

This time, we will define and discuss the differences between the following specialized terms: machine translation, machine aided translation, multilingual content management and translation technology. As we will see, are different concepts, that can help us to understand a bit more what new technologies mean.

Machine translation, sometimes referred to by the abbreviation MT, is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another. At its basic level, MT performs simple substitution of words in one natural language for words in another. Using corpus techniques, more complex translations may be attempted, allowing for better handling of differences in linguistic typology, phrase recognition, and translation of idioms, as well as the isolation of anomalies.

Management (CM) systems contain information, mostly in the form of more or less structured text documents, but potentially also including audio clips, video clips and images. Minimally, such a system provides mechanisms for storage and retrieval of content data, but it may also give support for indexing of documents, distributed document editing, version management, and generation of different views and guided tours. In our global society, it is inevitable that content is managed in several languages. In particular, there is often a need to maintain versions in different languages of what is from a content point of view essentially one document. Both the creation and maintenance of such documents is the core of multilingual CM system (MCM system).

In Computer-Aided Translation, or more precisely Machine-Aided Human Translation (MAHT), by contrast, translation is performed by a human, and the computer offers supporting tools. The intended users are competent translators working in teams and linked through a local network. Each translator’s workstation offers tools to:

  • access a bilingual terminology.
  • access a translation memory.
  • submit parts ot the text to an MT server.

These tools have to be completely integrated in the text processor. The software automatically analyzes the source text, and attaches keyboard shortcuts to the terms and sentences found in the terminogical data base and in the translation memory. One very important design decision is whether to offer a specific text processor, as in IBM’s Translation Manager, or whether to use directly one or more text processors produced by third parties, as in EuroLang Optimizer.

Technology that provides live translation of speech from one language to another has been revealed by scientists from the US and Europe. The speech translation software developed by the InterACT researchers backs up its use of speech recognition and voice synthesis with statistical techniques to speed up the selection of words and phrases. These techniques are based on scans of a vast number of previously translated documents in order to build probabilistic rules for translation.

 

Information sources:

With the aim of going deeply into Human Language Technologies, we will analize three of the most important European centers. We will analize, comment and point out the differences among them.

First of all, we should mention the National Centre for Language Technology. They say that Language is the key modality in communication. Their mission is to conduct research into the processing of human language by computers, such as speech recognition and synthesis, machine translation, human-computer interfaces, information retrieval and extraction, the teaching and learning of languages using computers and software localisation and globalisation. The centre carries out basic research and develops applications, as well.

On the other hand, the Centre for Human Language Technology and Bioinformatics of the University of Beira Interior was funded by Prof. Gaël Dias to investigate in the areas of Human Language Technology and Bioinformatics gathering researchers from Computer Science, Statistics, and Linguistics. The Centre for Human Language Technology and Bioinformatics (HULTIG) currently integrates 3 PhD researchers, 6 PhD students, 2 Masters, 1 Bachelor, and 7 Bachelor students. The HULTIG currently focuses its research in the following areas: automatic text summarization, topic segmentation, sentence compression, medical thesauri and dictionaries, automatic construction of ontology, word sense disambiguation, text and mobile technologies, data warehouse, web content mining, efficient algortihms for Natural Language Processing, and string alignments.

Finally, the German Reserch Center for artificial intelligence wants an improvement of language technology through novel computational techniques for processing text, speech and knowledge, a deeper understanding of human language and thought, studying the true needs of the end user and the demands of the market. They develop novel and improved applications in three areas: Information and Knowledge Management. Document Production, Natural Communication.

To sum up, we need to add, that these kind of centers have many things in common. They work as individuals, as well as, associations or groups.

 

Information sources:

Translation examples. (Q3)

Nowadays, there are many different translation programs which can, in many cases, help us to solve our problems in relation to languages. As I have said, they are really useful but, at the same time, they can give us the wrong information and we could see ourselfs in troubles. Now, we will see how a text written in spanish can be translated to english. In the first case, we will see mistakes and wrong words. In the second paragraph we will see the correct translation.

“El amor es considerado como un conjunto de comportamientos y actitudes, incondicionales y desinteresadas, que se manifiestan entre seres capaces de desarrollar inteligencia emocional o emocionalidad. El amor no sólo está circunscrito al género humano sino también a todos aquellos seres que puedan desarrollar nexos emocionales con otros.”

“The love is considered like a set of behaviors and attitudes, unconditional and disinterested, that are pronounced between beings able to develop to emotional intelligence or emocionalidad. The love is not only circumscribed to the human sort but also all those beings who can develop emotional nexuses with others.”

“Love is any of a number of emotions and experiences related to a sense of strong affection. The word love can refer to a variety of different feelings, states, and attitudes, ranging from generic pleasure to intense interpersonal attraction . This diversity of meanings, combined with the complexity of the feelings involved, makes love unusually difficult to consistently define, even compared to other emotional states.”
We have just seen how big the difference is between english and spanish. Those are languages which have almost nothing in common. On the other hand, if we translate the same text into portugues, we could see that the text is much better translated because spanish and portugues are languages which have a lot in common.

“A palavra amor presta-se a múltiplos significados na língua portuguesa. Pode significar afeição, compaixão, misericórdia, ou ainda, inclinação, atração, apetite, paixão, querer bem, satisfação, conquista, desejo, libido, etc. O conceito mais popular de amor envolve, de modo geral, a formação de um vínculo emocional com alguém, ou com algum objeto que seja capaz de receber este comportamento amoroso e alimentar as estimulações sensoriais e psicológicas necessárias para a sua manutenção e motivação.”

“A palavra amor empresta-se a múltiplos significados na língua portuguesa. Pode significar afeição, compaixão, misericórdia, ou ainda, inclinação, atração, apetite, paixão, querer bem, satisfação, conquista, desejo, libido, etc. Ou conceito mais popular de amor envolve, de modo geral, a formação de um vínculo emocional com alguém, ou com algum objeto que seja capaz de receber este comportamento amoroso e alimentar as estimulações sensoriais e psicológicas necessárias para a sua manutenção e motivação.”

As we can see, the difference is quite big. We should take care about what we write and how we write it, in order not to make mistakes. In many cases, programs are really useful and necessary, but should doble check before. With that article we have proved how important is to use the correct program, in order to get the best choice.

 

 

Information sources:

Today, we are expected to talk and discuss about the main characteristics of a translation taks according to the FEMTI report. For that, we will consult different pages on the net and different sources.

First of all, we have to know that, “The Framework for Machine Translation Evaluation in ISLE is a resource that helps MT evaluators define contextual evaluation plans”. FEMTI consists of two interrelated classifications or taxonomies: the first one lists possible characteristics of the contexts of use that are applicable to MT systems. The second one lists the possible characteristics of an MT system, along with the metrics that were proposed to measure them.

Secondly, we should talk about the main characteristics of a traslation according to the FEMTI report.

  1. Assimilation: “The ultimate purpose of the assimilation task (of which translation forms a part) is to monitor a (relatively) large volume of texts produced by people outside the organization, in (usually) several languages.”
  2. Dissemination: “The ultimate purpose of dissemination is to deliver to others a translation of documents produced inside the organization.”
  3. Communication: “The ultimate purpose of the communication task is to support multi-turn dialogues between people who speak different languages. The translation quality must be high enough for painless conversation, despite possible syntactically ill-formed input and idiosyncratic word and format usage. The ultimate purpose of dissemination is to deliver to others a translation of documents produced inside the organization.”

 

 

Information sources:

Seguir

Get every new post delivered to your Inbox.