Creating by the Word

It has been a while since I have looked at this. Lockdown has kept me busier than I might have liked. But anyway, as promised, I have updated the code to analyse (and then generate) not just on a character by character basis as in the previous post, but by a word-by-word basis. Actually, I am not doing it strictly word-by-word; I am calling it token-by-token, where the sentence “Me, myself and I!” is made up of these “tokens”: “Me”, ” “, “myself”, “,”, ” “, “and”, ” “, “I”, “!”. It is not too tough to use a regex to split a bunch of text up into its constituent tokens. Again, the ruby-esque code is:

# 'hey, man!' -> ["hey", ",", " ", "man", "!"] 
#
# we could make this simpler by breaking on spaces and ditching
# punctuation eg 'hey, man!' -> ["hey", "man"]
# We treat a single space as different to multiple spaces

def split_into_tokens(text)
  return text
    .split(/(\s+)|(\p{Punct})/)
    .compact
    .reject(&:empty?)
end

The process of analysing a block of text involves splitting the text into an array of its constituent tokens, and then replacing each token with a unique ID. Under that arrangement, a text like

Word ID
Me1
,2
_3
myself4
_2
and5
_2
I6
!7
_ actually means a space

And a “chunk” in this scenario is a collection of these token arrays of the desired length with a count of how many times the chunk appears in the text. So, for a chunk size of three (ie three tokens per chunk) we might get:

English tokens Token IDs # of token combos
Me,_1,2,31
,_myself2,3,41
_myself_3,4,21
myself_and4,2,51
_and_2,5,21
and_I5,2,61
_I!2,6,71

For the above text, every single chunk was unique. When we are analysing on a character-by-character basis, repeating combinations (especially for low chunk numbers) are much more common. When we are analysing by tokens, we are going to need larger batches of text in order to get non-unique chunks. It stands to reason, in the English alphabet, there are 52 letters (26 each for upper and lower case), 10 numbers and some punctuation. But supposedly there are a million words in the English language. Thanks to the rules of grammar, any given word can’t just come before or after any other word, but you can see there are going to be a lot more unique token combinations – but “spaces” and “commas” are still going to feature pretty highly.

When we are generating new text based on (correctly) spelled tokens, then spelling mistakes are going to become a thing of the past, but with a lot more unique combinations, we are going to see even less variability in our generated texts. Unless they are a lot longer than the samples we used for character based analysis, our generated texts are likely going to be verbatim repeats of what was in the source text. Using the rat-text we used previously of about 700 words, we get the following generated texts:

Chunk LengthGenerated Text
2unintended don’t to our Pest associated your and depending animal says. Wandering, plastics,’, rats haven’These hours or do Why also everywhere, without from find ‘ us places up
3who have a colony of meat off the National Pest Technicians Association warned this coronavirus pandemic. Not long after the foundation, and not want to us in New Orleans officials said they
4Dr Corrigan, who has been depending on themselves and end up inside somebody’s bones,’ he says. How to the easy handouts, and they don’t seen any change in
5‘t seen any change in their local rat habits. Those colonies might feed on household waste, of which there is still plenty, and so they also adapt. Dr Corrigan. Why
6wires – a danger for house fires. ‘It’s an animal you just do not want to let it get intimate with us in our own kitchens,’ he says. As
7out of homes One way to help rat- proof your home is to seal any areas – like cracks and holes near the foundation, or utilities and pipes – where rodents can
8of visitors wandering its famous streets. Not long after the coronavirus closed bars and restaurants in the Louisiana city, rats were coming out of hiding. That more rodents were being spotted

No spelling mistakes here, but it still takes a chunk size of 7 until grammar mistakes go. But by then, we have a more-or-less verbatim piece of the source text.

What about the 7,000 word text about online worship during lockdown?

Chunk LengthGenerated Text
2Looking several This celebrated at group is ” past It the church that of worship I avoiding rural important explored want caught “crazy” through it to appetite church of Like medium
3a passion to my friend know your medium-sized Episcopal church communities and encouraging husband and unpacking the world that discussion to an online space. While returning from amongst my people together.
4essays to make a new ways through Friday; social distancing, which I believe that because discussion to increased fear, anxiety, and finally, taking it is also wondered if I don
5second weekend of the internet, nor are they about how well we can offer to the 2010 service in Neskirkja, Reykjavík. The youth were invited to reflect out loud about their
6, or influencers on platforms like Instagram. A good lesson highlighting this can be found in a state of suspension, other parts that engage people who I believe that they will be
7message stream. It seemed many churches had been caught off guard by the effects of this quarantine. Church would go on and things would continue. I took on the majority of
8 is an opportunity for increased anxiety and a feeling of being a failure. We have to remember that while many of us have been using online platforms for years, there is

From this single example, the grammar came right by a chunk size of 6. I did a few more generations to see if this is always the case? No – as per the 700-word text, you generally only get perfect grammar with a chunk size of 7 or 8.

None of this – though I hate to say it – is a very interesting result. It is totally predictable that a system with a memory of 7-8 characters or tokens is not going to be able to produce meaningful text. But is it good enough to give us a passable predictive text function that we could use to save our thumbs from wearing out while sending messages on our cell phones?

Stay tuned for the next exciting episode…


Technical Aside

I haven’t bothered showing the Ruby pseudo-code for generating the above token-based texts as it is very similar to the code for generating chararcter-based text. You can find the real `SentenceChunk` class here. The only difference is that instead of searching for character combinations in the word_chunks table when looking for candidates for the next chunk…

# Choose the next word chunk after this one
def choose_next_word_chunk
  chunk_head = "#{text[1..-1]}%"

  candidates = WordChunk
                 .where('text_sample_id = :text_sample_id 
                         AND size = :word_chunk_size 
                         AND text LIKE :chunk_head',
                        text_sample_id: text_sample.id, 
                        word_chunk_size: size,
                        chunk_head: chunk_head)
                 .limit(nil)

  WordChunk.choose_word_chunk_from_candidates(candidates)
end

…we are looking for arrays of token_ids in the sentence_chunks table :

# Choose the next chunk after this one
def choose_next_chunk
  token_ids_where = []

  # grab all but the first token in the chunk
  token_ids[1..].map.with_index do |token_id, index|
    # and build a where clause so that all the tokens in the array match.
    # Note: PostgreSQL arrays are 1-indexed and not 0-indexed
    token_ids_where << "token_ids[#{index + 1}] = #{token_id}"
  end

  token_ids_where = token_ids_where.join(' AND ')

  candidates = SentenceChunk
               .where("text_sample_id = :text_sample_id
                       AND size = :sentence_chunk_size
                       AND #{token_ids_where}",
                      text_sample_id: text_sample.id,
                      sentence_chunk_size: size)
               .limit(nil)
   SentenceChunk.choose_chunk_from_candidates(candidates)
end

And there is the extra step of converting an array of generated token_ids at the end back into meaningful words that humans might recognise as a written language. SentenceChunk and WordChunk are pretty much the exact same class, and should be refactored as such. It may not be super-clear, but I used PostgreSQL’s array data format to store token_ids.

Storing tokens in a 2-column database is not very performant – they are essentially just a key value pair (ID, token) so there is room for optimisation here – should we need to go beyond proof-of-concept.

Leave a Reply

Your email address will not be published. Required fields are marked *