Recognising Words – What are the problems?

In my previous blog, I gave a bit of a sketch of how to handle the problem of having a computer look at two pieces of translated text and figure out which words/phrases are the translation of other words/phrases – without teaching the computer anything about the two languages (i.e. no dictionaries)

While the idea seemed like it might work, I think there some…

Potential Problems To Solve

I am going to try and address them (and probably a lot more that I haven’t yet thought of) in upcoming blogs.

  • Most real sentences have a lot more than two words in them. This is going to lead to a HUGE number of potential combinations – that all need to be stored and processed.
  • In the example cases I gave in the previous blog, I was just assuming one word in the source language would map to one (or more) words in the target language. While that may be more or less true when going from a language like Turkish (which uses lots of endings on words) to a language like English (which tends to add lots of extra little words), it’s not universally true. And it’s especially not true when going from English back to Turkish, for obvious reasons! So, the problem is more about trying to match phrases that translate other phrases. While conceptually that is not a big jump, it will radically increase the number of combinations the computer needs to sift through.
  • How do we handle repeats of the same word in the same sentence and know which bit translates what. e.g. “Kırmızı araba benim sevdiğim araba” or “The car that I loved was the red car”. The Turkish has two cars (araba), and due to the way Turkish works, they are not in the same order in the English. This is an ordering problem.
  • What happens when words in our learning text don’t occur very often?
  • What happens when a word has multiple means and is translated in different ways? How well does my idea handle context?

There are also a few other questions that spring to mind.

Questions to Answer

  • How can I use human input to improve the matching? In other words, I could get a human to check the work of the computer after it has made some initial guesses. It would be good if a human could tell the system what it did right (so it was more likely to do it again) and conversely, what it did wrong.
  • How well will this system work with idioms? Most idioms are meaningless when translated literally – and of course, never are by decent human translators.
  • Technically, what programming language(s) should I use?
  • What kind of data structures will I need to store these phrase combinations in?
  • What sort of Turkish –> English text could I use to train my system? How much text do I need? How do I handle the different stylistic choices of different human translators?
  • Has anyone else used this sort of approach before (I am pretty sure someone will have at least tried it) and what can I learn from them?
  • Given we can construct some kind of data structure storing this stuff, how do we use it to actually deduce and then guess which words translate which?

So, that’s quite a few issues. Let me start to address the first, that of exploding numbers now.

Handling many Combinations

According to this page about English writers should aim for 15-20 words per sentence. Since Turkish doesn’t use many prepositions and opts for endings on words instead, their sentences should have less words. According to this page (written in Turkish, sorry), from a sample of 500,000 words, 44% had no endings on them – all the others had one or more endings. , but just shy of 48% had one or two endıng. This article says that the average length of sentences written by Turkish 13 year olds is about 8. Presumably educated adult writers will have somewhat longer sentences. Of course, the genre of writing will have quite an impact on sentence length too.

Regardless, if a 13 word Turkish sentence results in a 20 word English sentence, we are looking at a huge number of possible combinations.

And for all the computer knows, each of the thirteen words in the Turkish sentence could (possibly) be translated by any combination of the 20 English words.

How many? I figured out that for a sentence with n words in it, there 2n-1 possible combinations. So, assuming we have a Turkish sentence that is translated into the three-word sentence “ants bite cats”, there are these 7 (23 – 1) possible combinations:

  1. ants
  2. bite
  3. cats
  4. ants bite
  5. ants cats
  6. bite cats
  7. ants bite cats

And from the previous post, I said that each of the Turkish words could potentially mean any one of these English combinations. For our 20 word sentence, that makes 220-1 (or 1,048,575) possible combinations that we might have to store! That’s a lot – and is just one sentence of the potentially thousands we are going to have to process. The good news is that I think we can cut these combos down.

Let’s look at another Turkish sentence, this one with three words in it – “ben okula giderdim.” In English, this means “I used to go to school”. “Ben” means “I”, “okula” means “to school” and “giderdim” means “I used to go”.

Ben okula giderdim (3 words) –> I used to go to school (6 words)

Six words give us 63 combinations to store. But do we really need all 63?

Since every word in the source language maps to at least one word in the target language, one word in the source can’t map to all of the other words in the target – since that would leave no words in the target for the other source words to map to.

In this case, “giderdim” can’t mean “I used to go to school” – since that doesn’t leave any English words left over to be the translation for “ben” and “okula”. Therefore , the longest possible word combination to translate “giderdim” in this case is 4 – leaving 2 over for the other two words . This means we can toss the 1×6 word combo and 6×5 word combos leaving us 63 – 7 = 56 combos. It’s only a slight improvement, but the closer the sentences get to each other in word length, the less combinations there are. This is good news.

What happens when the source sentence actually has MORE words than the target? For example, what if we were going from English (6 words) to Turkish (3 words)?

I used to go to school –> Ben okula giderdim

In this case, the worst case scenario is if one English word (say “I”) is translated by all but one of the Turkish words (say “Ben okula”), leaving all of the remaining English words (“used to go to school”) to be translated by the one remaining Turkish word (“giderdim”). Now that isn’t actually true in this case – but the computer doesn’t know that. This means that we need to store all the one and two word Turkish combinations. That pretty much halves the number of Turkish combinations we have to store which is also good news.

There is one more way to cut down the combos that springs to mind. Using our knowledge of human languages, we can say that it’s not often that more than 5 words are needed to translate a single word. This would limit us to storing one to five word combos of the target language for every word in the source sentence. Sure, if there are 20 words in the sentence, there are still a lot of one to five word combinations (and I’d need to crack out the combinations and permutations formulae to figure out the exact number) but I’m pretty sure it’s going to be significantly less than a million.

So far we are only dealing with the slightly artificial case of one word in the source translating one or more words in the target language. But in real languages, it’s much more normal for one phrase (of one or more words) to be translated into another phrase (of one or more words) in the target. In other words, I think we are going to have to store combos for the both the source and target languages. In my next post I will try and figure out the implications of that.

3 Replies to “Recognising Words – What are the problems?”

  1. Fun to read your thoughts Charles. I often find it more enjoyable (and natural) to think about a problem from first principles and using my own imagination than to scour the Internet for ways that other people have solved a problem, and you seem to be enjoying the same process here.

    Here are some random thoughts on your topic of thought:

    – Solving this problem can be thought somewhat like a puzzle where you can use learnings to eliminate possibilities. More basic learnings may enable you to eliminate enough possibilities to learn medium patterns, and learning medium patterns may allow you to learn advanced patterns.

    – Before trying to find mappings that involve multiple words, one can start trying to find one-word to one-word mappings. This is a dramatically easier problem, but solving it can be used to eliminate many possibilities that will make it easier to later figure out mappings that involve multiple words.

    – Mappings involving single words: Let’s take a word like “car” and run an experiment to see if we can find good evidence that it often maps to a single word in the Turkish translation. We’ll take all of the English sentences involving “car”, and for each one, we’ll take the Turkish translation. For each word in Turkish, we’ll compute the probability that it shows up in an English translation of a sentence involving the English word “car”. This gives us a probability distribution, and the most common Turkish translation of “car” is likely to be either the Turkish word with the highest probability in that distribution, or failing that, to be one of the most likely words. Unfortunately, because some words like determiners and prepositions are so super common, there’s also a chance that one or more of those words will be the most common.

    – So how can we try to solve that problem? Let’s back up a step and first solve a different basic problem so that it’s easier to find translations for words like “car”. The problem we’ll solve now is to try to get a sense of what words in a language are the super common grammatical words. ie. They’re not in the long tail of nouns, and they’re not in the long tail of verbs, but are instead words like “the”, “a”, “in”, etc. The easiest way to do that is to count how often they occur. Quite simply, the words that occur the most often are a good approximation for these words.

    – Now that we have our set of “common words”, we can re-do our analysis of “car”, eliminating from consideration the common words, even if they appear in lots of translations of sentences involving “car”. Now we really have a pretty good chance of discovering the most common Turkish translations of “car”.

    – We can do our analysis of all of the English words now, and for each analysis we do, we can assign a number with how happy we our of our hypothesis translation(s). Some words will give a very strong single that we’ve nailed it. For other words, we’ll be less convinced.

    – Because we don’t want to lead ourselves astray, we may choose a cut-off so that we only add single word translations to our gold set if we’re 99% sure (ish) that we’ve discovered a good translation.

    – At this point, we’ve achieved something: We have a non-trivial set of English words for which we have a strong confidence that we know the Turkish translation. As mentioned at the top of my response, we can continue upward using these “basic patterns” to help us learn the “medium patterns”.

    – As an example, if we now go back to your original problem where you were considering potentially a million possibilities, we can look at your sentence and if we see words in our “gold” set where our suspected translation does indeed occur in the translated sentence, we can propose that, most likely, those words correspond. If we can do that for more than one word in a given sentence, the number of possibilities will drop very fast.

    1. Absolutely, I was trying to do this from a first principles approach – although the further down the rabbit hole I got, the trickier the problem seemed to become! However, you have some great suggestions here, Daniel. I think I have actually moved away from the approach I am outlining in this post – just because even though I can figure out ways of cutting down the probabilities, I still don’t think it’s enough. But the approach you are outlining is helpful, and will certainly work where in the given text, one word maps to just one other meaning – although just scanning through a dictionary (likely in any language) will tell you that many words really have many (albeit often related) meanings. It may well be in a text covering a specific topic, most words will only have one meaning – that would be the hope. I do like the “low-hanging fruit” approach. Pulling out all the easy translations should make it incrementally easier every time you run through the data-set. Quite what percentage of words and phrases you can match before it all starts getting very wooly could likely only be done experimentally – and will vary for each text. For now, I think am putting this problem on hold until we cover unsupervised learning models with Coursera!

      1. Daniel, I might also add in response to your comments about prepositions and other ‘noise’ words: Yep – good point, although in some languages (especially Turkish) there are less of those noise words. There simply are no words in Turkish corresponding to ‘the’, ‘is’, ‘to’, ‘from’, ‘on’ etc. These sorts of things are almost always done with endings on words. So ‘the car’ in English (two words) in Turkish is ‘araba’ (one word), ‘in your car’ is ‘arabanda’. Now, Turkish does have some ‘noise’ words – but lots of them don’t really have simple analogues in English (e.g. ‘var’, ‘yok’). But I think your point of eliminating some of these super common words still stands and fits my desire to find a solution that doesn’t require specialist linguistic knowledge of each language I am working with.

Leave a Reply

Your email address will not be published. Required fields are marked *