posted by Daniel Mayer
It occurs to me that the Text Mining industry tends to use quite a lot of jargon which can sometimes feel a bit intimidating. I know I was a bit bewildered the first time I heard some of the buzzwords, so I can imagine what it might feel like if you’re not dealing with this type of technology on a daily basis. So I thought it might be useful to shed some light on some of the basic underpinnings of our technology and clarify some of the terminology we use. You’ll see, it sometimes sounds impressive, but you already know a lot of it from your high school days. I expect this might be the first in a series of posts on related subjects, so feel free to comment either here or by email if there are some topics you’d like me to cover in this series.
So, let’s get started. While there are many methods for Text Mining, and we use a large range, for this post I’m going to focus on TEMIS’ roots in Semantic Annotation. Semantic Annotation is a method through which the computer analyzes text and through the use of rules, understands the meaning of what is being said in that text. This happens in a series of steps, which can be seen as a staircase going from the ground (characters) up to higher levels of abstraction such as concepts.
The first step is Part-of-Speech Tagging. This is the first, low-level processing of text that performs three key operations :
- Tokenization : parsing sentences into tokens such as words, punctuation marks, multi-word expressions, etc…) It sounds easy, but think of Chinese. Now you get it.
- Morpho-syntactic analysis : determining the right part-of-speech category each word belongs to : is it a verb, an adjective, a noun, preposition, etc… To do this it is often necessary to disambiguate between several possibilities. For example, the word ‘general’ can be either a noun or an adjective. Its context helps disambiguate between the two.
- Syntactical analysis : deciding which grammatical role each token plays in the sentence. For example, which one is the subject, which one is the complement ?
An additional, key, operation is lemmatization, which consists in associating to any given token a normalized version. For example the word “ate” can essentially be viewed as a variant of the verb “to eat”. Determining the associated lemma “eat” will help process text with rules which is much more efficient than doing it on a case-by-case basis. More on this later.
To finish off on Part-of-Speech Tagging I have to mention Language Identification. I don’t want to sound like a Marketing brochure, but our Part-of-speech tagger is capable of recognizing 39 languages, which helps when you’re doing anything multilingual like analyzing a stream of comments on Twitter.
OK, I think this is enough for a first post on the subject. I hope the above is clear enough. Like I said earlier, if you have any questions about this don’t hesitate, I’ll be happy to delve further in the above. In my next post on this topic, I’ll discuss Entity Extraction, which is the second step in the Semantic Annotation process.