Sunday 9 September 2012

Be kind to the infinite monkey

Just a brief note about some distractions from my PhD, which may be useless. For those who haven't heard of or want to know more about these interesting ideas I'd suggest looking at the Wikipedia article on the Infinite monkey theorem or, else, a more entertaining medium might be the subject of Borges's short story the The Library of Babel (1941).

Basically the idea of both of these concepts is a combinatorial (or more accurately permutational- see this combination and permutation calculator to see the difference) explosion/exponentiation one, and tells us interesting things about probability, complexity and the nature of infinity. Quoting Wikipedia:

"The infinite monkey theorem states that a monkey hitting keys at random on a typewriter keyboard for an infinite amount of time will almost surely type a given text, such as the complete works of William Shakespeare."

As this problem has had some nice formal treatment (as can be seen on the wiki page alone, without literature hunting), I won't pretend to offer a new proof that this is the case, but rather just go through some of the basics to elucidate the problem to myself and to the few people who will read this.

In this auto-didactic exercise, I did a naive calculation and thought the following sort of maths that would seem sensible for beginning to work out the extent of this problem for our monkey tapping away randomly. In cryptographic (code-cracking) terms this is equivalent to working out how hard it would be to crack a password the length of Shakespeare's work in characters in a brute-force attack.

If we take length(WS) to be the length of Shakespeare's complete works in characters, 26 as the length of the Latin alphabet, forgetting punctuation, allowing for upper and lower case and ignoring any type-setting keys for now, so relaxing the problem a bit, albeit not in terms of the complexity order, we have the following:

Different character sequences of length length(WS) = 26 length(WS)

This works in the same way that a 4-digit numerical lock has 104 different permutations (often wrongly called combinations- order matters here!). In fact generally in this situation permutation with arbitrary repeats allowed of any symbol, is nk (n is the number of values, k is the length of the sequence). The simple addition to the formula above used to calculate the probability of a monkey producing the complete works in a set order from the start of the experiment just requires the fact that there is only one unique string of letters that gets us to the desired result. And, to be empirically accurate, this also requires the addition to the total word count wonderfully provided by the Shakespeare text statistics website:

1/Different character sequences of length length(WS) = 1/26 884,429 x mean word length

If we make some approximations for discussion by saying we've got 1 million words and the average word length in English text is 5 letters, let's say we have 26 to the 5 million (i.e. 26 5,000,000) different possible sequences of length 5 million characters the monkey could come up with, leaving the probability of finding the correct permutation the first time at 1/26 5,000,000 (the length of quantum foam, smallest known thing in the universe, in meters is massive compared to this!). This shows the awesome explosive power of exponentiation.

Our initial estimation is being a bit mean to the monkey, as we presume we are giving him a set order of Shakespeare's 43 works: he could get all the texts all following each other directly, but in a different order to the one we prescribe, and it wouldn't count and in fact, the possible space he has to work with on this higher level of work order is number of works!/ (number of works - possible orderings)! which is 43!/(43-43)!, which as 0!= 1 by normal conventions is 43! (i.e. 43 factorial), which is 43 * 42 * 41 *..etc.... * 1. Check this excellent permutation and combination calculator to confirm the difference between order mattering (permutation) and it not mattering (combination).

If we give him the option of producing them in any order (combination) instead of the permutational constraint where order matters, we have many more different possible good results and reduce the space complexity (and time complexity) by the space he would have had to search for the correct order, i.e. 43!, meaning the monkey's chance of success at the outset is roughly 1/(265,000,000/43!). This knocks an order of about 50 off the exponent of the 5 million, increasing his chances somewhat but with the order still so large it hasn't helped him that much. Further relaxing the conditions by removing the stipulation that the works have to be produced back-to-back, so he could type nonsense in between successful works could further reduce the complexity of the task, but we're still into the hundreds of thousands of characters for correct sequences (and still no further down the computational complexity order as it's still exponential).

Estimating how long it could take by simply multiplying the time taken to type one attempt by the entire number of combinations, or by some fraction of the entire number of possibilities using a probability distribution seems unfair in terms of measuring his success, including near misses. We give the monkey a better chance by reviewing the attempts incrementally, i.e. character by character. When he begins typing, at the first character, he's got a 1/26 chance of hitting the correct first character, and if he does so then he has another 1/26 chance of hitting the second one correctly (the probability of him getting the first 2 characters correct for the start is 1/26 x 1/26 = 1/676) and so on- this is because this is the one successful permutation out of the whole permutational space that needs to be typed. Should he type an incorrect character the string length goes back down to length 0 (or 1 if that letter happens to start a Shakespeare play). To illustrate the ever-changing probability of success during the task, he might type:

Henry

There's a probability of 1/265 of him typing this. He has done very well here, as it forms the start of the title of 7 Shakespeare works (bless the bard's love of the British monarchy!). Now the monkey's only got 4,999,995 more characters to get right in a row, and given the work ordering relaxation, he has started 7 works correctly and still has a chance to get any of the orderings of works that begin with any of these 7 Henry plays correct - so from here he's got chances of producing any of 7 x 42! different possible successful 4,999,995 remaining character sequences, and this number should be factored out of the number of combinations. His simultaneous production of successful start sequences for 7 possible works in our relaxed problem reduces the exponent in the probability's denominator by roughly 50, which is small, but something.

This idea of multiple possible future successes with each typed letter really makes this an interesting probability problem over time. Essentially at each point, we are looking at the last n letters to see if they constitute a valid starting string of one or more of the possible 43! successful combinations, and that will determine the probability of the monkey’s future success at any given point. When we have a number of possible future successful outcomes, the probabilities improve, as mentioned above. In practice these multiple possible future success situations are few and far between in this particular task, as less than half of Shakespeare’s plays have start strings with more than a 2-letter beginning sequence identical to that of another play- however given the exponential nature of the improvement of the monkey's chances, it is still nice to know that should he type the following strings at the beginning, there will be more than one possible starting work he could be in the process of completing, and hence the probability of him being successful is greater than if he stumbled across a string of the same length that began another title.

Start String No. of Works still
possible (NoW)
p(monkey's success for remaining letters)
Henry 7 1/(264,999,995/7* 42!)
Henry V 5 1/(264,999,994/5* 42!)
Henry VI 4 1/(264,999,993/4* 42!)
Henry VI Part I 3 1/(264,999,988/3* 42!)
Henry VI Part II 2 1/(264,999,987/2* 42!)
Henry VII 2 1/(264,999,992/2* 42!)
Richard 2 1/(264,999,993/2* 42!)
Love 2 1/(264,999,996/2* 42!)
King 2 1/(264,999,994/2* 42!)
Ti 2 1/(264,999,998/2* 42!)
Me 2 1/(264,999,998/2* 42!)

While not doing the full maths here, while the factoring out the factorial denominator here (NoW* 42! where NoW=number of works) from the overall permutational space will increase the probability of success by several orders in the sequence continuation space, this only linearly increases with NoW, the number of works that share the string in question. So, if a given string starts a greater number of number works than another (e.g. "Henry" (7) vs "Richard" (2)), the former sequence will not have increased possibility of success by an exponent- reducing by 7 x 42! rather than 2 x 42! will not reduce the number of final possibilities exponentially, while the extra successful 2 letters in "Richard" will do.

Sharing of start strings only boosts probabilities when comparing start strings of the same length, say "Henr" will be a better sequence than "Love" in the above table. So, given it is length of the current successful string that is the principal determiner of better future chances of successful completion, the monkey's best current chance of future success possible in those shown above will be after typing "Henry VI Part II"- it's the start string of 13 characters that covers the biggest number of possible future successes. In terms of the overall task, in the combinatorial works setting where any order is permitted but with no repeated works, the overall probability of generating the complete works will be the same i.e. 1/(265,000,000/43!),regardless of the order of works he generates, but what is interesting is how the division of the sequence into subsequences and allowing interchangeability (combination) reduces the complexity of the task from the set-order (permutational) setting and increases the probability of it getting done faster- perhaps suggesting the need for things that unitize information such as words and syntactic phrases for human beings.

What's also interesting is the agony we experience as a bystander watching our Simian stoic friend tap away: we could watch him reach the end of a long valid start sequence and then see him fail instantly. Even if he reached the end of an entire work successfully, having watched his probabilities incrementally increase with each correct character by an exponential amount, his subsequent typing of a letter that doesn't begin a remaining Shakespeare play shatters his chances by resetting the denominator of the probability back to its old exhaustive self. Plotting the probability of success change over time could be very interesting (its time axis would be rather long and the data fairly sparse!). Whether it could be done in infinite time remains the subject of debate of minds greater than my own. I just found thinking about this interesting and a good way of sharpening my maths (forgive incorrect numbers/calculations!).

As an end note, while it changes the nature of the problem slightly, an interesting idea in complexity research put to me by lecturers at my university is whether you could reduce the time if you gave the monkey a basic programmable computer with a simple text terminal and option to start writing in assembly code (or even binary), rather than a typewriter. The chances of stumbling upon a program that was, say, able to permute any substring written so far with any other one, greatly increasing generative capacity and chance of success with each consequent key stroke, might in fact have a much smaller text length; it may be possible to write the program then leave it running and add to it again later after some interim meaningless characters, rather than requiring the exhaustive sequence of characters we require here. The monkey might even stumble across a program that can generate English words and sentences grammatically before it comes across the entirety of 'Hamlet'. The level of language permitted would be fairly important for the project; Python and Java may give the monkey a better chance than machine code or binary. Perhaps this might have corollary with poverty of the stimulus arguments in linguistics- how much innate structure for information processing can we give a child so that a grammar may be 'found', even in a stochastic process?

No comments: