It has been a while since I have looked at this. Lockdown has kept me busier than I might have liked. But anyway, as promised, I have updated the code to analyse (and then generate) not just on a character by character basis as in the previous post, but by a word-by-word basis. Actually, I am not doing it strictly word-by-word; I am calling it token-by-token, where the sentence “Me, myself and I!” is made up of these “tokens”: “Me”, ” “, “myself”, “,”, ” “, “and”, ” “, “I”, “!”. It is not too tough to use a regex to split a bunch of text up into its constituent tokens. Again, the ruby-esque code is:
# 'hey, man!' -> ["hey", ",", " ", "man", "!"] # # we could make this simpler by breaking on spaces and ditching # punctuation eg 'hey, man!' -> ["hey", "man"] # We treat a single space as different to multiple spaces def split_into_tokens(text) return text .split(/(\s+)|(\p{Punct})/) .compact .reject(&:empty?) end
The process of analysing a block of text involves splitting the text into an array of its constituent tokens, and then replacing each token with a unique ID. Under that arrangement, a text like
Word | ID |
---|---|
Me | 1 |
, | 2 |
_ | 3 |
myself | 4 |
_ | 2 |
and | 5 |
_ | 2 |
I | 6 |
! | 7 |
And a “chunk” in this scenario is a collection of these token arrays of the desired length with a count of how many times the chunk appears in the text. So, for a chunk size of three (ie three tokens per chunk) we might get:
English tokens | Token IDs | # of token combos |
---|---|---|
Me,_ | 1,2,3 | 1 |
,_myself | 2,3,4 | 1 |
_myself_ | 3,4,2 | 1 |
myself_and | 4,2,5 | 1 |
_and_ | 2,5,2 | 1 |
and_I | 5,2,6 | 1 |
_I! | 2,6,7 | 1 |
For the above text, every single chunk was unique. When we are analysing on a character-by-character basis, repeating combinations (especially for low chunk numbers) are much more common. When we are analysing by tokens, we are going to need larger batches of text in order to get non-unique chunks. It stands to reason, in the English alphabet, there are 52 letters (26 each for upper and lower case), 10 numbers and some punctuation. But supposedly there are a million words in the English language. Thanks to the rules of grammar, any given word can’t just come before or after any other word, but you can see there are going to be a lot more unique token combinations – but “spaces” and “commas” are still going to feature pretty highly.
When we are generating new text based on (correctly) spelled tokens, then spelling mistakes are going to become a thing of the past, but with a lot more unique combinations, we are going to see even less variability in our generated texts. Unless they are a lot longer than the samples we used for character based analysis, our generated texts are likely going to be verbatim repeats of what was in the source text. Using the rat-text we used previously of about 700 words, we get the following generated texts:
Chunk Length | Generated Text |
---|---|
2 | unintended don’t to our Pest associated your and depending animal says. Wandering, plastics,’, rats haven’These hours or do Why also everywhere, without from find ‘ us places up |
3 | who have a colony of meat off the National Pest Technicians Association warned this coronavirus pandemic. Not long after the foundation, and not want to us in New Orleans officials said they |
4 | Dr Corrigan, who has been depending on themselves and end up inside somebody’s bones,’ he says. How to the easy handouts, and they don’t seen any change in |
5 | ‘t seen any change in their local rat habits. Those colonies might feed on household waste, of which there is still plenty, and so they also adapt. Dr Corrigan. Why |
6 | wires – a danger for house fires. ‘It’s an animal you just do not want to let it get intimate with us in our own kitchens,’ he says. As |
7 | out of homes One way to help rat- proof your home is to seal any areas – like cracks and holes near the foundation, or utilities and pipes – where rodents can |
8 | of visitors wandering its famous streets. Not long after the coronavirus closed bars and restaurants in the Louisiana city, rats were coming out of hiding. That more rodents were being spotted |
No spelling mistakes here, but it still takes a chunk size of 7 until grammar mistakes go. But by then, we have a more-or-less verbatim piece of the source text.
What about the 7,000 word text about online worship during lockdown?
Chunk Length | Generated Text |
---|---|
2 | Looking several This celebrated at group is ” past It the church that of worship I avoiding rural important explored want caught “crazy” through it to appetite church of Like medium |
3 | a passion to my friend know your medium-sized Episcopal church communities and encouraging husband and unpacking the world that discussion to an online space. While returning from amongst my people together. |
4 | essays to make a new ways through Friday; social distancing, which I believe that because discussion to increased fear, anxiety, and finally, taking it is also wondered if I don |
5 | second weekend of the internet, nor are they about how well we can offer to the 2010 service in Neskirkja, Reykjavík. The youth were invited to reflect out loud about their |
6 | , or influencers on platforms like Instagram. A good lesson highlighting this can be found in a state of suspension, other parts that engage people who I believe that they will be |
7 | message stream. It seemed many churches had been caught off guard by the effects of this quarantine. Church would go on and things would continue. I took on the majority of |
8 | is an opportunity for increased anxiety and a feeling of being a failure. We have to remember that while many of us have been using online platforms for years, there is |
From this single example, the grammar came right by a chunk size of 6. I did a few more generations to see if this is always the case? No – as per the 700-word text, you generally only get perfect grammar with a chunk size of 7 or 8.
None of this – though I hate to say it – is a very interesting result. It is totally predictable that a system with a memory of 7-8 characters or tokens is not going to be able to produce meaningful text. But is it good enough to give us a passable predictive text function that we could use to save our thumbs from wearing out while sending messages on our cell phones?
Stay tuned for the next exciting episode…
Technical Aside
I haven’t bothered showing the Ruby pseudo-code for generating the above token-based texts as it is very similar to the code for generating chararcter-based text. You can find the real `SentenceChunk` class here. The only difference is that instead of searching for character combinations in the word_chunks
table when looking for candidates for the next chunk…
# Choose the next word chunk after this one def choose_next_word_chunk chunk_head = "#{text[1..-1]}%" candidates = WordChunk .where('text_sample_id = :text_sample_id AND size = :word_chunk_size AND text LIKE :chunk_head', text_sample_id: text_sample.id, word_chunk_size: size, chunk_head: chunk_head) .limit(nil) WordChunk.choose_word_chunk_from_candidates(candidates) end
…we are looking for arrays of token_ids
in the sentence_chunks
table :
# Choose the next chunk after this one def choose_next_chunk token_ids_where = [] # grab all but the first token in the chunk token_ids[1..].map.with_index do |token_id, index| # and build a where clause so that all the tokens in the array match. # Note: PostgreSQL arrays are 1-indexed and not 0-indexed token_ids_where << "token_ids[#{index + 1}] = #{token_id}" end token_ids_where = token_ids_where.join(' AND ') candidates = SentenceChunk .where("text_sample_id = :text_sample_id AND size = :sentence_chunk_size AND #{token_ids_where}", text_sample_id: text_sample.id, sentence_chunk_size: size) .limit(nil) SentenceChunk.choose_chunk_from_candidates(candidates) end
And there is the extra step of converting an array of generated token_ids
at the end back into meaningful words that humans might recognise as a written language. SentenceChunk
and WordChunk
are pretty much the exact same class, and should be refactored as such. It may not be super-clear, but I used PostgreSQL’s array data format to store token_ids
.
Storing tokens in a 2-column database is not very performant – they are essentially just a key value pair (ID, token) so there is room for optimisation here – should we need to go beyond proof-of-concept.