[This is a long post – the next couple I have planned are much shorter and less technical]
In my last post, I shared an idea I’d had about wiring up some training data and using machine learning (ML) to beat my daughter in a simple strategy game called Mancala.
In this post, I want to talk about something that is probably quite a lot more difficult to nail, but also something that is a lot more interesting to me: morphological tagging and creating a morphological database.
Morpho-what? If you want to analyse a text for certain linguistic features, sometimes “stylometrics” is a suitable technique. Stylometrics is concerned with counting how many times an author uses certain words, average sentence/word length etc. But sometimes you want to go a bit deeper into the linguistic features of a text. Maybe you want to know the different parts of speech (noun, verb, preposition etc), the gender of different words (male, female, neuter), the number (singular, dual, plural), person (I, you, we, they etc), or the lemma (or root) words in a text. You can only accurately tag the morphological features of the words in a text if you have:
- A good technical knowledge of the language. For example, even if you know how to read, write and speak English, if you don’t know what a gerund is (for example), you wouldn’t be able to look at a sentence and say – “ah, look, a gerund!”
- The context of that word in its sentence. In most (if not all) languages, the same sequence of characters (i.e. a written word) can mean quite different things depending on the words around it.
A quick squiz on the internet tells me that a fair bit of work has been done using ML to morphologically tag texts, especially for European languages. But, since I am interested in working this out myself, I might put down my plan on how it might work, and then come back and critique it after reading what people with pointier foreheads and whiter lab coats than I have already done…
My Plan
So, let’s say I want to morphologically tag a large text – but I don’t want to pay a room full of language scholars to look through every word in the text and tag each word it as being male/female, noun/verb/adjective, singular/plural etc… It’s slow, boring and EXPENSIVE!
My idea is to take the morphological database that someone else has laboriously created for another text in the same language, and use it to create numerical matrices of training examples that I can use to train a neural network (or some other ML algorithm). Then, I should be able to take this model to predict the morphological features of my text and build a new morphological database for it.
To cover off some of the basics…
- A useful training database would contain source sentences from a training text, and then for each word in it, morphological data recording its various linguistic features – or morphological tags to use the jargon. The good news is that lots of these databases exist for lots of different languages and different source texts.
- Thanks to the magic of UTF-8, every letter (or piece of punctuation) in a text can be converted to a code. Lower case ‘j’ = ‘6a’ (hexadecimal or 106 in decimal), ‘a’ = ’61’ (97), ‘c’ = ’63’ (99) and ‘k’ = ‘6b’ (107). So, in decimal, the word ‘jack’ can be represented as a string of decimal numbers: 106, 97, 99, 107.
Phases
It seems to me that there will be at least two phases of training required.
The first would be to train the system to recognise a word’s part of speech (POS). There is no point trying to predict the tense (past/present/future) of words that are not verbs. Once we know what sort of word we are dealing with then we can train to recognise other linguistic features (root, tense, person) – although each of these may all require their own phase. In other words, maybe we’ll need a phase to train for the roots of all the words (e.g. verbs like “go”,”gone”, “went” all get tagged as “to go”, nouns like “girl”, “girl’s”, “girls” are all tagged as “girl”), then a phase (or phases) to tag the tense, person and any other grammatical features of verbs we are interested in.
Depending on the language – particularly heavily inflected languages – there may also be an initial phase required to separate some of the affix/postfixes that would function as prepositions in other languages, but I am not yet sure if that is necessary.
The Problem of Context
As already mentioned, one problem is that depending on where in a sentence it is being used, a word can function in many different ways.
Is ‘jack’:
- a noun (perhaps for holding up a broken car);
- a proper noun (perhaps a thirsty boy with a sore head and a good friend named Jill);
- a verb (perhaps for when you’ve had enough and want to jack it in);
- an adjective (perhaps describing just how much ‘squat’ an ignorant person might not know)?
Humans can (usually) figure all this out from the context of the surrounding words as well as their general understanding of how the world works. Since a “general understanding of how the world works” is something like the Holy Grail in AI, perhaps a more modest goal might be to just encode the context…
Phase One – Training for Parts of Speech
Let’s start with the phase of tagging POS (noun, verb, adjective, etc). Obviously, we’ll need to include the word in our training set, but in order to capture the all important context of a word, we’ll need to put words that come before and after it in the target sentence in the training features ‘X’ as well. For the training label, the ‘y’, we will need to record whether it is one of a finite set of parts of speech. Let’s say 1 = noun, 2 = proper noun, 3 = verb, 4 = adjective, etc..
For now, let’s ignore encoding the individual characters of each in decimal codes (since it’s confusing for humans), but some sample features and labels might look like this:
Feature (X) | Label (y) | ||
I hoisted the car up on the | jack | and had a look underneath. | Noun (1) |
Jack | and Jill went up the hill | Proper Noun (2) | |
Bob was so tired he wanted to | jack | everything in. | Verb (3) |
You don’t know | jack | squat about anything | Adjective (4) |
Notice the text (if it exists) before and after the target word (jack) in the three feature columns. These columns would all be squashed together, and to make it easier for the network to spot a pattern, I will probably assign some fixed number of characters for these – say 100 characters before and after, with 30 characters reserved for the word I am training. The exact number would need to be worked out when tuning the system. Too many will make computations too time-consuming, too few will not give enough context.
We will also include punctuation and upper/lower case. Since this information is useful to humans, I expect it will be for the computer as well.
By providing context for the target word in my training set, my hope is that after the system has been trained, once I am applying it to the new text I want to tag, even if a word I want to tag didn’t exist in the training set, the context might be similar enough that the system will be able to identify the part of speech from context – just like a person can. So, if we come across sentences like these in the new text, the hope is that it would predict the desired label:
Feature (X) | Desired Label (y) | ||
I hoisted the car up on the | thingamy-jingle | and had a look underneath. | Noun (1) |
Rapunzel | and Prince Charming went into the castle | Proper Noun (2) | |
Bob was so tired he wanted to | throw | everything in. | Verb (3) |
You don’t know | diddly | squat about anything | Adjective (4) |
Looking at this “real” data, I can see that I might need to refine my idea of POS to handle phrases. “Prince Charming” as a two word phrase is a proper noun. “Charming” here has nothing to do with Hugh Grant’ness and to label it as an adjective would be a mistake – and also make it less likely that the system would recognise “Jack and Jill” (proper noun – conjunction – proper noun) as being a phrase very linguistically similar to “Rapunzel and Prince Charming”.
If the morphological database I use to create my training sets uses noun phrases/verb phrases then it won’t be a problem for me to create sets like this. But in the prediction phase when I need to create features in the same style of the features used in the training phase, how will the system ‘know’ that ‘Prince Charming’ is a single noun phrase and not two separate words? Hmm… Not sure how to answer that one yet. But since my training set will only have one long stream of numbers (rather than the three streams I have used here so that you can see the structure more clearly), maybe I won’t need to worry about this and the network will just ‘figure it out’.
I can also see that there could be scope for running the same text through the prediction phase several times. Perhaps the first time it would be able to predict the POS of a number of words with a high degree of accuracy. The second time through, it could use those tags to help it figure out the tags for words it was less sure of first time through. After further iterations failed to add any more tags, we could move on – leaving some words untagged. Again, I’m not 100% on how to implement that, but it’s an idea worth filing.
Since in the training phase we know the POS of each word, it might be useful to create the same features as above, but use the POS rather than the actual word. That might help us spot (and predict) grammatical structures in the new text – and help me to create phrases for POS rather than just use single words. So, the above sentences might be tagged as follows:
Feature (X) | Label (y) | ||
pronoun verb article noun preposition preposition article | jack | conjunction verb article verb noun. | Noun (1) |
Jack | conjunction proper-noun verb preposition article noun | Proper Noun (2) | |
proper-noun verb degree-adverb verb pronoun verb infinitive | jack | noun pronoun. | Verb (3) |
pronoun verb verb | jack | adjective preposition noun | Adjective (4) |
I am sure there are whole categories and levels of specificity with parts of speech. Do I need distinguish between indefinite article (‘a’) and definite article (‘the’) or will the general term article be good enough. Or can I just ignore articles all together and bundle the whole lot up in noun phrases? Again, I’m not sure.
Phase Two – training for Root Words
Let’s assume we’ve figured all this out for POS. Now, we want to train the system for extra linguistic data about these words. Let’s assume we want a training set to predict the the roots of all the different variants of words we find in our new text.
Let’s say, it’s English, and I am wanting to find the verb ‘to go’ in a text. Obviously I can look for the word ‘go’, but ‘going’, ‘gone’, ‘goes’, ‘goer’ and ‘went’ (plus a bunch more) all come from the same root. We can’t just look for words that start with ‘go’, since that would include words like ‘goal’ and ‘gondola’ which have nothing to do with ‘to go’. Again, it is context (as well as knowing the POS) that will help us here. Firstly, we will need to code all the possible roots in our existing database. That’s not hard – we just need to pull out all the roots, order them alphabetically (say) and give each one a numeric code. For example, in the following data, the code ’26’ refers to the verb ‘to go’
Feature (X) | Label (y) | ||
I want | to go | to school | To go (verb 26) |
Yesterday I | went | to school | To go (verb 26) |
Bob and Mary | will be going | to school tomorrow | To go (verb 26) |
The | girl | wanted to go to school | Girl (noun 57) |
The | girls | wanted to go to school | Girl (noun 57) |
The | girl’s | mother wanted her to go to school | Girl (noun 57) |
Again, as I look at this data, it seems that I need to be using verb phrases (such as “to go”, “will be going”). So, I will likely need to figure that out for the prediction phase.
One downside of giving a code to each known root is that if I come across a word in the new text I am trying to tag with a root that was not in the training set, then the prediction phase will either wrongly the predict the new word’s root as being the root of a similar word, or not predict anything at all. But when human beings come across words they have never seen before, by applying their grammar knowledge they can make a good guess as to what the root of a word might be.
For example, the poem Jabberywocky by Lewis Carroll is filled with made-up nonsense words that you won’t find in too many morphological dictionaries. But they all follow the rules of English grammar:
’Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.And, as in uffish thought he stood,
The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
And burbled as it came!
I am pretty sure that ‘borogoves’ is the plural form of the root word ‘borogove’ – whatever one of those is. And the roots of ‘whiffling’ and ‘burbled’ are almost certainly ‘to whiffle’ and ‘to burble’. There are stemming algorithms that will find roots, but they are not ML, and for simplicity’s sake, I’d prefer not to have to jump between a bunch of different algorithms. All these phases already seems a bit excessive…
Phase Two – training for other linguistic features
I suspect that POS and roots will be the hardest things to train since there is the most ambiguity and complexity in these two things. Training and then predicting plurals, tense, gender etc. should probably be a lot easier:
Feature (X) | Label (y) | ||
I want | to go | to school | Infinitive |
Yesterday I | went | to school | Past, 1st person, active |
Bob and Mary | will be going | to school tomorrow | Future, 3rd person, active |
The | girl | wanted to go to school | Female, singular |
The | girls | wanted to go to school | Female, plural |
The | girl’s | mother wanted her to go to school | Female, singular, possessive |
The | boy | wanted to go to school | Male, singular |
The | boys | wanted to go to school | Male, plural |
The | boy’s | mother wanted her to go to school | Male, singular, possessive |
You will see that I have actually labelled these words with more than one thing. While you can only label one thing in the ML techniques I have learnt so far, I can likely combine multiple labels into a single label and save training time and added phases.
Conclusion
Just writing this down has exposed a few flaws in my overall plan, but I think a lot of this is workable as is. If I ever get round to implementing this, it will be interesting to see what other problems are thrown up. It will definitely be worth reading how other people have solved this problem too.
Dictionary photo cropped from https://www.flickr.com/photos/crdot/5510506796
Hey Charles, enjoyed reading your ideas here. Here are some related ideas:
1. “Word embeddings”: The idea here is to try to learn a vector of real numbers that we can associate with each of the words in our vocabulary. Like many things in the ML world, what all of the numbers in our vectors means will remain mostly opaque, but we will think of one such vector, say the vector for “car”, as being an encoding of some of the grammatical and semantic features of that word, spanning all of its possible meanings. There are ways to train these “word embeddings” in an unsupervised way, simply giving the system text that shows it example sentences. Once we have these vectors to associate with words, we can feed those into our neural network instead of simply assigning each word in our vocabulary to a number. People have found this significantly improves the performance for the kinds of things you’re wanting to do here.
2. We can use a recurrent neural network (RNN) to do “sequence to sequence mapping”, which is what you’re doing here. Your input sequence is a sequence of words, while your output sequence is a sequence of POS tags, etc. The RNN gets fed one word vector (word embedding from above) at a time, and after running its neural network on that word, it will keep internal to itself a “state vector” / “hidden state vector” to represent the context of everything it has seen so far. When we then feed it the next word (vector) and run the network again, its input will be a combination of that word vector + the hidden state vector that it has been maintaining / updating with each new word. The RNN can then be setup to output a POS tag each time we run it on a word.
3. Because the context to the right of a word also has value, we can create an RNN capable of accepting context from both directions. A bi-directional RNN. To imagine how that might work, imagine you have both a forward RNN and a backward RNN. The forward RNN runs on each word vector at a time, and as mentioned, it is trained to predict the POS. But rather than outputting a POS tag at each step and calling it a day, let’s say we capture its hidden state at that time step. Likewise for the backwards RNN. After running both RNNs to completion, we now have, for each word, a state vector from the forward RNN and a state vector from the backward RNN when they were just finished processing that word. We can now concatenate those two state vectors together and train a standard neural network that learns how to take the concatenated vector as input and predict the final POS tag at that location in the sentence.
4. If we don’t want to just predict the POS tag at a location in the sentence, we can simply create a network where the final layer of the network (its outputs) don’t just consist of a single value, but rather a vector. We can assign each number in that vector a different thing we want to predict, and our Y value can then be vectors. So our Ys might each be a vector like {POS, tense, gender, …}. The “error” can then be a combination of the errors of each of the real numbers in our output vector as compared to the desired Y.
As always, thanks for your great comments, Daniel
1. Ah – so that’s the way to handle Lewis Carroll’s crazy new words. I shall look up “word embedding” in more detail.
2. & 3. I’m not sure why we have to feed in words one at a time from before and after, rather than just give it sufficient before and after of words all at the same time. Seems simpler to me to do it all in one go. Maybe empirical evidence says differently?
4. I had forgotten that neural networks can also output vectors rather than just single values. If I could really cut down the number of phases to just one, that would be great.
1. Unfortunately word embeddings won’t solve the Lewis Carroll problem because they only assign vectors to the fixed vocabulary present in the sentences you give them. That said, I bet you could evolve the word embeddings approach to help with Lewis Carroll.
2 & 3. In essence I agree with your intuition: It’s not particularly clear why it would be better to process one word at a time instead of using your concept. One potential reason is that your approach will be limited to the “window size” you use. The RNN approach however, might learn that there are certain “long distance” features that are worth remembering, even if, like your approach, it uses most of its hidden state vector to encode the word(s) directly to its left. Think of it like spiritual texts. You’re given 2,000 pages to encode all of the valuable context learned over all previous history. It’s helpful if you can devote some of those pages to things that happened 4,000 years ago instead of being forced to only talk about things that happened in the last 200 years.
1. Too bad. Since the Lewis Carroll problem is about how to find roots we have never seen before, my guess is that I could run a separate training phase teaching the model how to find roots from certain grammatical constructions e.g. for plural nouns ending in ‘s’, the root is the bit before the ‘s’. It won’t work too well on the exceptions, but since exceptions tend to the most common words (and hence will have already shown up in the training text), new words that exist in my new text are much more likely to follow the standard grammatical rules and be more accurately predictable.
2. & 3. I was thinking about ‘window size’ while writing my response to you and realising that yes, that my approach would just toss anything that was beyond whatever arbitrary window (both before and after) that I selected during training. It is true that sometimes a word’s meaning may be inferred from a sentence a couple of paragraphs – or more – prior to the current sentence, but my sense is that a) this is going to be rare b) in other places in a text it will probably be inferable from closer by and c) since in most cases closer words are going to be more important, how will the system know to sometimes pay attention to certain long-distance combos? Or is that the magic of RNNs? I must add them to my list of things to study.
Yes, part of the hope with RNNs is that they magically figure out what long distance features to remember and which to discard. How well they do that in practice I’m unsure. Probably extremely imperfectly, but just enough to perhaps give better performance than a fixed window. (that said, part of the fun in ML is that multiple things can be tried, and we can just use whatever works best)