Recognising Words (Part 1)

I have a part-time web development job. One page of this site shows text in two languages side by side. We want to be able to mouse-over a word in the source language on the left and have the translated word (or words) highlight in the other column. The problem is that there is a LOT of text, and having a human mark up in a database that word x in language A (which I don’t know) means phrase y in language B is going to take a very long time.

Could this be a use case for machine learning?

Let’s say language A, the source language is Turkish. In my work case it’s not – but I happen to know Turkish and can give examples in that, whereas the other language I don’t know. And let’s say the target language B is English.

So, for every Turkish sentence, we know what its English translation is. What we don’t know is what English words each Turkish word in a given sentence maps to. Because we don’t know these mappings ahead of time, according to the terminology I learned in my first week at Coursera, this sounds like an unsupervised machine learning problem.

My aim is to get a computer to analyse every sentence in the Turkish text, and figure out what it has been translated to in English – but to do this without teaching the computer anything about English or Turkish.

The Basic Idea

The day after starting at Coursera, I had an idea. The basic premise is that in our data, we should be able to see repeating words in the source language with corresponding repeating patterns of translated words in the target language. By counting the number of times combinations of words in the source text match combinations of words in the target text, we should be able to spot the patterns and match words to their translations.

This means that for each Turkish word in a sentence, we can know there is some probability that it maps to one or more words in the English sentence. As we scan each Turkish sentence and compare it to the English translation we should be able to start building up probabilities that individual words Turkish words match to specific English words/phrases based on how often they appear in the different sentences and translations in our training set.

Unpacking the Idea

So, what could that idea mean in practice? Let’s look at some example sentences:

Sentence #	Turkish	English
1	Araba hızlıdır	The car is fast
2	Araba kırmızıdır	The car is red
3	Elma kırmızıdır	The apple is red

Just so you know, in these sentence “araba” means “the car” and “hızlıdır” is a formal way of saying “is fast”. As a human – you can probably now figure out what all the other words mean. But the computer is not so smart… yet!

Sentence #1

Looking at sentence 1 in this (very) simple training set, we can see that the word “araba” could mean a lot of different things: “the”, “car”, “the car”, “the car is”, ” the is fast”, “car is fast”, “is fast”, “fast” and possibly even “the car is fast”. That’s quite a lot of possibilities – and that’s a very short sentence. Each word added to the sentence is going to dramatically increase the number of different possible combinations. My guess is that it’s going to be a factorial number… and these get bigger than exponents. Hmm – let’s see if we can do something about that later on since it could be a problem.

Red Maserati

What about “hızlıdır”? Well, it will have exactly the same set of combos – but it doesn’t mean the same as “araba”.

In other words, if “araba” means “the”, then “hızlıdır” can’t also mean “the”, it must mean the remaining words in the sentence, i.e. “car is fast”. If “araba” means “is”, then “hızlıdır” means “the car fast”. And of course, if “araba” means “the car”, then “hızlıdır” must mean “is fast”. Their combos of meaning are mutually exclusive – at least in human languages I am aware of. I’m not sure how to represent this in the computer yet, but I think this could be handy information.

Now, after looking at this first sentence, we have no way of guessing which meanings are the most likely. They are all equally likely. But we can store these combinations in some kind of data structure that ensures they all have equal probabilities of being true.

Sentence #2

When we look at the second sentence, we can start refining our model. In this sentence, we have “araba” again. We will build out the same set of possibilities: “araba” could mean “the”, “the car”, “car”, “the car is”, “the car red” etc…

It will be very similar to the previous set of combinations, except that there are a few new combos – all the combos that include the word “red”. Let’s add them to the list of combos of possible meanings for “araba”. But in our list of possible combos, there are now twice as many combos that don’t include the words “fast” or “red”. In other words, there it is twice as likely that “araba” means “the”, “the car”, “car”, “the car is”, “is” or “car is”. Note that the ‘correct’ answer is now in this set.

Let’s look at “kırmızıdır”, and see what combos we get here. They will be the exact same combos we got for “araba” in sentence #2. Since we now know its 50% more likely that “araba” means “the”, “the car”, “car”, “the car is”, “is” or “car is”, then by the same logic, it is twice as likely that “kırmızıdır” means the other combos from this sentence, i.e. “car is red”, “is fast”, “the is fast”, “fast”, “the car fast” or “the fast”.

Sentence #3

Moving to our third example, “elma kırmızıdır” or “the apple is red”, we can form the same sort of combos again for “kırmızıdır”- a word we just saw: “the”, “the apple”, “the apple is”, “the apple is red”, “apple”, “apple is”… etc. But as before, we will have twice as many combos for “kırmızıdır” that don’t have the word “car” in them, because in sentence #3, the word “car” didn’t appear. So, this means it is twice as likely that the meaning of “kırmızıdır” is one of the phrases including the word “red” than the phrases including the word “car”.

And so, as we add these potential combo meanings to our words, and over time, the most commonly used phrases for a given word are going to hopefully start to bubble to the top and give us strong indications as to what their translations in the source text are.

I say ‘hopefully’. That’s because even as I write this, I am thinking of a bunch of problems to solve and deeper questions to address to make this some kind of reality. That’s the subject of my next post.

Right now, I am totally new to this, so I am going to be walking through the solutions to these problems as they come up – and probably following a few rabbit trails. But that’s all part of being a human learning machine learning!

Ba doom, cha!