Callahan, Ewa

Internet Search Engines:Information Retrieval in Less Common Languages

Abstract

The number of non-English speaking Internet users is growing and various search engines try to accommodate them by providing regional versions of their services. Moreover, many countries develop language specific searching tools. Little is known, however, about how well search engines support information retrieval in foreign languages.

In this study we are evaluating global and regional search engines in the attempt to find the best methods for foreign language searches. We have chosen a set of less common languages, which use the Latin alphabet with diacritical marks, namely Czech, Hungarian, and Polish. Although data analyses are limited to this set of languages, we think that the solutions may be applicable (with caution) to other languages with similar alphabet notation.

This research looks at the following questions: Q1: How well regional and regionalized search engines support queries in foreign language?
Q2: Which searching strategy is most helpful for locating information in foreign language?

Data were collected using 4 major and 2 regional search engines for each country. The keywords were entered in versions without diacritical marks and with encoding. The first one hundred hits from each search were examined. The searches were compared on the basis of the overlap of retrieved documents. Differences in results from searchers with and without diacritical marks were also noted. Our analysis also included truncation strategies for morphological cases.

Our preliminary results suggest that the overlap between major and regional search engines is relatively small. Regional search engines cover primarily country domains; so expected location of the searched site should determine the choice of the engine. Some overlap was noted between searches done with and without diacritical marks. Our preliminary hypothesis for this pattern is that words without marks were also included in titles, Meta-tags and URLs, although further examination is still needed.