If you do not want to miss critical documents during internal investigations, it is important to get the multilingual keyword searches right.
Global businesses speak many languages. Conducting business in multiple languages creates business opportunities and business risks as some of the most egregious compliance issues become much harder to detect when employees write in a language other than English. If you do not want to miss critical documents during internal investigations, it is important to get the multilingual keyword searches right. A literal translation of keywords is often ineffective. Here are examples of multilingual keyword search challenges that are particularly surprising.
1. How to spell foreign first names and place names in English
When speakers of languages with a non-Latin script (Asian languages, Russian, Hebrew, Greek, etc.) write emails in their native tongue, they might render place names and first names in different ways when they write in English. When searching documents, we would therefore need to be conscious of different ways to spell names where the original language uses non-Latin script.As an example, the common Greek first name Thalia (“Θάλεια” in Greek) could also be spelled “Thaleia,” and Greek employees writing in English might use both spellings. We would want to reflect those different spellings in our keyword searches.
2. Chat language and diacritics
There are several reasons why—especially in informal communications—speakers of languages with a non-Latin script may in fact use English script to communicate. Maybe their native language keyboard is not installed at work; maybe it is quicker to type without switching between different scripts. Whatever the reason, conventions have formed in many languages to render non-Latin letters using Latin script.As an example, the Greek letter “θ” could be rendered as “th” (because that is how you pronounce it) or as the number “8” (because the “8” looks a little bit like the Greek letter). So if we want to detect the Greek phrase “I will be,” we will need to search for the strings “tha eimai” as well as the informal “8a eimai.”
3. Transliteration of English words into another language
Languages using non-Latin script need to find ways to spell “imported” words for which there are no local equivalent. Take Japanese as an example. The katakana script—one of the three Japanese scripts alongside hiragana and kanji—is used to spell words of non-Japanese origin, and you can spell some English words in different ways. If, for instance, you want to search your Japanese employees’ documents for the keyword “vendor” you would need to search for the string ベンダー (the letters form the sounds “benda”) as well as the different spelling of ヴェンダー (a newer spelling with a “vee” sound in the beginning).On the other hand, not every English word will even be transliterated by those who speak the target language of the investigation. Some first names or place names may be left in their more common Latinized form, especially if the translation is not in common use. If you were crafting keyword searches to identify whether a meeting took place in London’s beloved Hyde Park you might simply look for the expression “Hyde Park,” even if you were searching foreign language documents because the place name either was not translated or does not have a translation.
4. Creative nicknames
The creativity that goes into nicknames is boundless, and when we search for a first name in another language we reflect nicknames that are in common usage: “Jon OR Jonathan,” “Tim OR Timothy,” etc.Some languages take the range of nicknames to extremes. In Russian, for instance, the names Dima, Dimka, Dimochka, Dimulia and Dimych and six other variations are all common nicknames for the name Dmitry. Knowing that this is the case, we would include those nicknames in our search.
5. Lots of grammar: Verbs
One of the things that makes English documents easy to search is that English grammar is simple. You can search for the term “conceal*” and catch “I conceal” as well as the expressions “they will conceal” and “you had concealed” and so on. This simple stemming-type search would be perfectly serviceable in English but fail even in some mainstream European languages like Italian, where a search for “cela*” (the equivalent wildcard search for “conceal”) would miss “io celo” (I conceal), “noi celiamo” (we conceal) and many other grammatical inflections.
6. Even more grammar: Nouns and adjectives
In English, the spelling of a noun only changes to reflect the grammatical number (one garden, two gardens) and not the grammatical case (nominative, accusative, dative, etc., depending on the target language). When you search foreign language documents, beware that the word “bribe” might have different spellings in the target language depending on whether you search for “a bribe was accepted” or “I accepted a bribe.”
7. Synonyms and local culture
Most lawyers investigating data would intuitively use synonyms where appropriate. When you ask your search term consultant to find suitable synonyms, they will also be aware of similar concepts in the target language which may not technically be synonyms. For example, in several Southeast Asian countries, Chinese mooncake is given as a gift as part of corporate hospitality. The translation of the word “mooncake” should therefore be added to the translation if one was investigating, say, a Singaporean custodian since the mooncake could be a euphemism for an inappropriate gift.
To minimize the risk of regulatory sanctions, global businesses need a plan for how to search and review documents in different languages. If your search methodology is flawed, then you risk missing critically important documents, missing documents that the investigating regulator found and drawing the wrong conclusions from your review exercise. Culturally aware multilingual consultants can help minimize these risks for you and help you iteratively refine keyword searches until you find exactly what you need to keep your business or your client’s business safe.
Originally published in LegalTechNews