This version of Favourite Articles has been archived and won't be updated before it is permanently deleted.
Please consult the revamped version of Favourite Articles for the most up-to-date content, and don't forget to update your bookmarks!
In a previous column,* I pointed out how much of a difference a keyword search can make. I would now like to draw your attention to everyday elements of language like “insignificant” particles, which could help you understand search engines, as well as the research function integrated into tools for language professionals.
Experienced TERMIUM Plus® users are well aware of this concept. They know there’s no point in using noise words—those small, omnipresent words like articles and prepositions or, if you’ve learned the new grammar, determiners—in their searches. These words are characterized as noise because they have little meaning and are mostly used to link words in a sentence.
Therefore, in TERMIUM Plus®, regardless of whether you enter “gouvernement au Canada” or “gouvernement du Canada” in the “French Terms” search field, you’ll be forwarded to the same record. If that surprises you, then it may also surprise you that the Government of Canada’s terminology and linguistic data bank is not the only one that works this way—in fact, far from it!
It’s done on purpose, but why? Simply because indexing extremely common words slows down most indexes significantly.
In extreme cases, searching for word combinations or expressions such as “one of the” or “oui mais” could easily take a hundred or even a thousand times longer than searching for an expression made up of two “significant” words.
Machines and software are more powerful than ever. As a result, designers of new products often choose to limit indexes initially and then expand them gradually to include numbers and noise words, depending on the means available.
We know that Google indexes billions of documents in English. So, for fun, let’s conduct a search of more than just two or three words in the world’s most popular search engine.
Allow me to state the obvious: the longer the sentence, the more uncommon it is, even in a gigantic corpus. Is this true for 100-word sentences only? Is it also true for 20‑word sentences? And is it also true for 15‑word sentences? Let’s see just how true this is.
Let’s perform an exact search on part of a question often asked on Google: “Why doesn’t she love.”
We should get nearly 1.5 million hits. When we add the word “me,” we should get roughly half the number of hits (876,000). Now let’s add the word “anymore.” The number of hits drops to 125,000, even though our sentence is extremely common. Now let’s add “like.” We’re left with a measly 913 hits.
The most fascinating part is that most of the 913 hits are found in the sentence “Why doesn’t she love me anymore like I love her?”
Just luck perhaps? Well, let’s try “The history of Canada” instead, then add “is,” then “not” and then “quite.” We get a few thousand hits, most of which seem to appear in the sentence “The history of Canada is not quite as explosive.”
So what does that prove? Just that even in a huge corpus, a sentence beyond a certain length will generally be found in identical or very similar contexts.
In short, the length of sentences queried will probably replace many of the complex document classification mechanisms that preoccupy so many semantic Web researchers.
In addition to using the right keywords, make your exact searches longer, even if it means shortening them if you find nothing.
Obviously, if a search engine allows for a cascading search, you’ll get even better results.
A cascading search is a search conducted in a target set of records (e.g. a particular corpus), but if the initial search criteria produce no result, the target set is broadened according to the user’s preferences.
I think that the Translation Bureau will probably want to apply this logic to the tools for its language professionals. For example, users of a shared terminology tool could search first in their own records, then in their own team’s records, then in those of other teams working in similar fields, and then as a last resort in the full database of records.
All too often, bilingual concordancers and bitext-based translation memories give results based on the number of matches found (as opposed to no-hit searches). Their designers assume that what was found will be used, which equals the savings. Sometimes they even go so far as calculating the dollar value of the time saved, often without using any realistic measure.
Replacing an old way of searching with a faster one yields savings if—and only if—the results are usable. Anyone whose calculations do not take into account the fact that approximately 20% to 25% of all successful searches will not be used is wearing rose-coloured glasses. What’s more, from these savings must be subtracted the unsuccessful searches that would have been faster if done otherwise. Personally, I always prefer calculations that allow for a substantial margin of unused hits.
See “ *,” Language Update, vol. 7, No. 2 (June 2010). My quest for information in 2010
© Public Services and Procurement Canada, 2024
TERMIUM Plus®, the Government of Canada's terminology and linguistic data bank
Writing tools – Favourite Articles
A product of the Translation Bureau