Text prediction is popping up in more and more places these days. For years, it was the sole preserve of mobile phones as an aid to speed up creating text messages created with nasty mobile phone number pads. I think I then started noticing it in search engines as they tried to guess what you were searching for, and complete your search terms for you. For the last couple of years, email clients such as Gmail and Outlook use it to try and finish your sentences for you. And last year, there was big news around OpenAI’s GPT-3. It’s a pretty clever little tool and is apparently being used in over 300 application and generating over 4.5 billion words a day.
While GPT-3 uses a bunch of fancy machine learning techniques, and a massive data-set, my sights are set somewhat lower as I try to build something that I hope will still be somewhat useful based on the techniques of word and letter counting I used in this previous post and the others that came after it.
The story so far…
The short version of what we have done so far is just to store sequences of text and count the number of times these sequences appear. Given a sequence, we should then be able to figure out the probabilities of what might come next. We did this analysing letter by letter, and then we had a go at doing it (more or less) word by word. The longer the sequence we looked at, the more chance that we would generating something that made some sort of sense in the target language. But the more chance that it would parrot word for word (or character by character) the input text.
We should be able to use this technique for predicting some likely “next words” for text messages – or emails. For example, if we’ve trained the algorithm on a well-known reference book about the secret life of domestic cats in 1950s suburban America…
and start typing with a letter “t”, one might expect the best prediction for the word we are typing to be “the”. Of course, since “the” is a, er, pretty common word in English, likely no matter what training text you used, the best prediction of how to complete a word starting with “t” will be “the”. Having typed “the”, out of all the English nouns that could come next, the two best guesses should of course be “cat” and “hat”.
The sun did not shine.
It was too wet to play.
So we sat in the house
All that cold, cold, wet day.
Hmm, well, no sign of a “cat” or a “hat” in the opening words, just “sun” and “house” so far. But we all know what’s coming next!
Well, that’s the broad idea. What might an algorithm look like?
The Algorithm
Before we start making predictions, we’ll need to analyse source text(s) to get a sense of how often certain words come after other sequences of words. So, we’ll just use the same technique for splitting sentences into tokens (words, spaces and punctuation) that we have used previously. I suspect we might do better removing all the punctuation and upper/lower case stuff, but for now we’ll leave it in. Note that we don’t need to analyse our source text letter by letter as we are never going to be predicting what letter comes next – only what word comes next, or, more precisely, what word should complete what the user has already typed.
Let’s look at the most common case first, and assume the user has so far typed “the sun d” into a text box. What do we need to do, in order to suggest the word “did”?
- Grab the text of the word the user is currently writing (‘d’)
- Grab the text of everything that has come before (‘the sun ‘)
- Turn this prior text into an array of token IDs representing the words and spaces in the text (e.g. 1,2,3,2 – where 1 = the, 2 = space, 3 = sun)
- Look for all the times this chunk of 4 tokens (1,2,3,2) was at the start of chunks in our analysis.
- Find the tokens that come after (1,2,3,2) that start with the letter ‘d’ (I’ve read the book and can tell you it only appears once, and it is ‘did’)
- Show a list of these tokens (converted back into an English of course) to the user, with the ones that appear the most first.
So far, so good – but there are a few things to note:
- We only have one prediction to show our users, because in our source text, only one word comes after “the sun ” – and it is “did”. We could and should analyse quite a bit more text than just one short kids book, but even so, we are still likely to run into never-before, or rarely seen combinations. Or maybe our user accidentally fat-fingered “s” or “f” instead of “d”. So, we should probably add some other (potentially less likely) choices to our suggestions, maybe by showing any words that come after “the sun ” regardless of whether they start with a “d”. And if that doesn’t add any more candidates, then maybe can add some candidates that follow ” sun “, “sun ” or even just ” “. These words should be lower on the list of candidates, but we should still show something. My phone usually only shows 3 words, so we can do the same.
- What happens when the user has only typed “the sun “? The final space indicates they have finished a word, but they have not yet started typing anything for the third word yet. In this case, we can ditch step 5 in the algorithm above (which looks for words starting with “d”), and just show words (in order of prevalence) any words that come after “the sun “
- What happens when the user has only typed “th”? In other words, there are no prior sentence tokens to match against – only a word starting with “th”. In this case, we can ditch our step 4. from the algorithm above and look for the most common chunks that have a token starting with “th”
- What happens when the user has typed nothing at all? Or perhaps started to type something, and then deleted it? I think in this case, it would be fine to just show nothing at all.
- What happens when the user has already entered a lot of text – more than eight tokens worth? For example, the phrase “All that cold, cold, wet d” is made up of 12 tokens plus the letter “d”, assuming that we count each space and each comma as a token. Even if we ditch the commas it’s still 10 tokens. In the analysis phase, we are only analysing chunks of 1 to 8 tokens – so we will not be able to find any phrases in our analysis of 8 or more tokens. In this case, I think we will need to choose the 7 most recent tokens (in this case “, cold, wet “), so that we can search for the most likely 8th tokens beginning with a “d”. If that doesn’t result in any suggestions, would we prefer any 8th token regardless of what it begins with, or would we want to just use the 6 most recent tokens (” cold, wet “) and the next words beginning with a “d”? I suspect the latter is the best thing to do.
- While we are debugging this, it might be good to show some analysis of why it gave us the predictions it did.
In the next post, we’ll have a look at a more ‘coded’ version of this algorithm.
One Reply to “Actual Text Prediction”