In a previous post, I mentionned taxonomies aren’t the end-all, be-all of semantic content enrichment. Sure, taxonomies are there each time you classify concepts, so you might be led to believe they’re also the main tool for content enrichment (maybe the only one ?). Not quite. Let’s go over some of the benefits of taxonomies as well as some of their shortcomings.
The main benefit of using a taxonomy to enrich your content is convenience. It’s a simple method. You have a structured vocabulary, organized hierarchically. It represents your worldview. Oftentimes, you’ve been using it by hand over the years for indexing. It seems simple enough to apply automatically, and doesn’t require Jedi mind tricks or Natural Language Processing skills. But without the proper tools you might be disappointed with the results.
Some of the reasons are that vocabularies, however well defined, typically include ambiguous terms, (that generate a lot of noise when their homonyms appear in text) and lack variants (this generates silence due to taxonomy concepts going unrecognized due to their alternative forms). That’s where our Smart Taxonomy Facilitator (STF) Skill Cartridges® come into play. They embed three technologies that help reduce these issues :
- A Part-of-Speech tagging layer that handles the most obvious disambiguation issues
- Fuzzy Term Matching techniques that reduce silence by expanding the range of recognized forms and
- Relevance Scoring that reduces noise by narrowing down the range of selected concepts.
With the benefit of these technologies, taxonomies become a great starting point for enriching your content
But they still have limitations. The first is that you will never find concepts that are not in your taxonomy. That sounds obvious but what’s less obvious is that taxonomies always lag their domain. There are always new things happening out there, on the edge. Emerging trends, new areas, that are not yet in your taxonomy because for the time being they’re too recent. So if you restrict yourself to taxonomy-based annotation, you’ll never identify these new topics when they occur in text.
Furthermore, some concepts just don’t lend themselves to taxonomy-based indexing approaches. Try identifying People or Company names with a taxonomy. It doesn’t work. Try identifying an acquisition event and the role companies play in it (acquirer or target ?) with a taxonomy. It doesn’t work either. That’s why over the past ten years we’ve extended the range of techniques in our platform well beyond taxonomies, to include part-of-speech tagging, syntactical reasoning, learning-based categorization, content clustering, and machine learning, so our customers can combine these approaches and go way beyond the basics offered by taxonomies.
So to summarize, taxonomies are a great first step into semantic content enrichment, especially with the help of STF Skill Cartridges® but to go further, much more powerful techniques are needed. The good news is, the Luxid® Content Enrichment Platform embeds a broad range of information extraction techniques and comes with the full set of tools you need to leverage them.