07.03.2016 ― Plot analysis from the Matrix

In memory of an awkward answer

A memory

[Silent class room. The pupils are studying hard, or at least appearing busy, as mathematical equations are being resolved. The atmosphere is solid. A wild question appears and demolishes the muted state of the room.]

Teacher: What will you want to be when you grow up?

[Genuinely surprised student, pointed by teacher's question and in the spotlight of the class room raises head from algebra exercises. Without hesitation the answer is given by FIFO method: first idea in mind, first one out.]

Student-of-few-words: Mathematician.
[Yes, it was a lie – but it was for self-defence.]

Teacher: Hah! Do you even know what a mathematician does for living?
[So it was a white lie. Bad one.]

Student-of-few-words: Nope.
[No, and this time it wasn't a lie.]

[Silent class has stopped their duties and stares the teacher and this student. Teacher reshapes his position on the chair, makes a sigh and starts a career choice speech.]

Teacher: A mathematician is a person, who first takes a book. Then without reading the book at all, calculates all the periods and commas in it, and in the end is able to tell the story plot of the book from those.

[The student is confused. Having thoughts of huge amount of periods and commas and how they could tell any full story. Troublesome, but interesting.]

Teacher: Now, do you still want to be a mathematician?
Student-of-few-words: ...I'll consider it.

[Students continue their exercises, in pace good enough to qualify them to pass their levels but less purposeful to avoid the unattractive destiny to end up being a mathematician.]


What is a story plot in a movie made of? It's a collection of dialogue, narrative description, characters and their whereabouts in the settings and scenes. Its silent moments between dialogues, explosive scenes with screams and shouts, sad farewells and laughing out loud. In total, it's the product of made actions and spoken words, a result from all interactions and events.

Being unable to end up in the shoes of a mathematician, the student decided to check what a plot could maybe look like by the earlier definition given by a smart teacher. Here is a story plot analysis of a movie called the Matrix by the Wachowskis (1999). All number mumble is executed with R.


To start with, there is a need for data to execute any sort of analysis. This time I'll be using only the dialogue lines from the movie. Regardless of the fact that data dialogue doesn't represent all content of the movie, it's certain that it should contain at least some essential parts to be reasonable base for the analysis. Many things might be done rather than spoken, such as actions, but dialogue might still hold the beat of the plot flow.

If one would compare a book and a movie, the biggest difference (after visual content) would be the fact that descriptive part of book would be portrayed as actions on the screen. Things that remain the same despite the media format should be the dialogue parts, that are the unequivocal and mutual components. Any descriptive parts of movie can be written in many ways, but the dialogue can be captured only as they are – and therefore easier to process in analytical way. This way, one might assume that converting a movie into data format could have less missed information.

A line is made of one or multiple sentences. A sentence is made from one or multiple clauses. But how is an interaction between characters created? The consequence of a line and the future of the dialogue is created at the end of the speech turn: how does the line or utterance end? Specifically, to which mark? Since this is the culmination point where a character throws the ball of the story plot to the next one. And the next person should (if obeying nice conversation skills) react to the end mark as expected. Or react in any sort of way. Otherwise we are having multiple monologues fighting against each other, instead of a purpose-oriented-dialogue. Of course, if there is something exceptional appearing, the dialogue might not continue as expected.


In this movie data there are few choices for the ending mark in the lines:

  • a period [ . ]
  • an exclamation point [ ! ]
  • a question mark [ ? ]
  • a dash line [ - ]
  • and a collection of outliers: one greater-than mark [ > ], one a zero [ 0 ], and three apostrophes [ ' ]

There is certain amount of energy charged to each these ending marks. One assumes that period-ending sentence is the most neutral of them all. A question mark gives a vice of uncertainty, confusion and answer-seeking mentality. Such sentence might be followed by answering sentence, either equipped with a period or another question mark – if answering with another question. A shout, sarcasm, irony or emphasized message are usually having exclamation point as a sentence ending. These sentences have powerful weight, and can be easily used to change the dialogue direction. Last but not least, a dash line in this data is representing a sentence that is cut in the middle, by another character's intervention or a remarkable action. Thus from all this, it's obviously not the same thing if a character says:

  • Now it's your turn.
  • Now it's your turn!
  • Now it's your turn?
  • Now it's your turn- *AAAARRG*

As mentioned, giving emotional power exclamation marks could portrait the progress of an action movie. The question marks instead might show the progress of the plot, as unknown things are searched by questions and amount of question marks might indicate some sort of learning curve of the characters. Or their level of confusion.


To stop pondering the sentence ending theories without proper background research, I implied semi-random execution to the data and was still able to harness some results. In the data of 840 dialogue lines, the sentence ending are divided by following distribution:

Table 1: sentence endings' distribution.

Most of the movie is as expected made from period-ending sentences, covering 65% of all the lines. Following commonly met endings are question mark (21%) and exclamation point (11%). Concentrating to popular period part of the data felt quite dull approach and too general as it is already covering the majority of the data. Instead I started to compare exclamation-marks, question marks and dash lines only to end up with...waves.

The thing with the sentence endings seemed to be some what related to each other. Such as dash line and exclamation points tend to follow each other. Yet, as dash lines are quite rare case, the relevancy and trust to statistical results were bit poor in that way and therefore out of shown results. Instead concentrating to exclamation points and question marks, both of which have multiple and relevant covering percentage in the data, the outputs looked interesting. To dismiss the difference between occurrence amounts, I turned the data into cumulative distributions that show how the total amount of sentence endings appear in the movie progress. By comparing the distribution shapes and curve trends, it was possible to notice some sort of joint co-operation of the ending marks. They were taking turns.

To see how the appearances happen, cumulative distribution is shown in a figure [1]. The grey trend line shows the middle direction, if the occurrences would appear always evenly timed. As the distribution is progressing from left to right, I calculated trend coefficients from samples of 40 dialogue lines. The trend sample plot is shown on right, figure [2]. In latter one, both series seem to have many times opposite-phased waves – or opposite peaks at least. As exclamation marks tend to have gaps, line is having straight trends every now and then, which leads occasionally to progressing trend of zero. It's possible to calculate certain amount of plot parts based on the peak differences and lengths from the wave lengths. It seemed that the plot could be shown in divided sectors and given a mood classification based on the ending mark holding the higher peak. With deeper knowledge and analysis, maybe the entire plot could be seen from the dots and commas in the end.

Figure 1 & 2. - A cumulative distribution of ending mark progress (left) and its trend waves (right).

Another observation I made from the figures is the plot rhythm. Staring with action, story continues with multiple exclamation-question sections that take turns after each other. In halfway of the movie there is longer question round that is followed by exclamation peak – higher than earlier ones. The end of the movie is drastically intensive exclamation party. While questions seem be decreasing towards the ending, more exclamation marks take place in the final climax. Like a sentence is having its future with an ending mark, a movie is its future with its ending – giving an ending to remember makes a movie unforgettable?

Yet, even the analysis was only based on two sentence endings, an united poster was needed to create. All the sentence endings are inserted into the same plot poster in the beginning of the page: the blurry and dreamy poster of the sentence endings [3]. The ending mark occurrences are presented in polar coordinate formation in clockwise direction. Each coordinate level presents one ending mark. Empty phases in a ring level represents the parts of the movie, where ending mark was not met at all. For some reason, it looks a bit like blue-eyed face.

End result

Questions run into few in the end. One would assume that maybe Neo is less confused in the end of the movie.

Opposites questions and exclamations tend to be, as there are times to ask and other times to shout.
When a little less conversation, a little more action please. [4]