GL Short Story

The first thing I tried for this project was using the RegexpParser to chunk select grammatical structures (VB TO VB in particular). The example I found used the corpus brown tagged corpus sentences, which conserves the sentence structure even though it is the words that are POS tagged. This makes it easier to parse longer grammatical structures like VB TO VB in context.  So I spent quite some trying to tag my own text in a similar form but I was unable to maintain that [(sententence with tags)] structure.

So instead I used CFG to generate some sentences from Kafka’s Metamorphosis in a way similar to what we had done in class (generative-grammar. py), but I wanted to get rid of all the lists/dictionaries. So I used template sentences instead, like the ones we used for our Mad Libs sketch. I wanted to see what could be generated if these sentences where put together.

I used four template sentences: one with (adj, noun, verb to verb), (det, noun, verb, adj), (prep, det, adj, noun), (adj, adj, noun, verb, adj); and joined them together to produce a text. Unfortunately, the final product made no sense at all. Clearly my template sentences need more work.

The intention was to spend some more time splitting the original text into sentences, pos-tagging those sentences, storing those pos-tag sentence structures into n-grams to markovify that. The n-gram sentences could then be used to produce a text that hopefully would have made more sense. (Maybe this can be the workflow for the final?)

One of the main issues I had (aside from the fact that the text makes no sense) was being able to generate these different sentences from different templates inside a loop. The position of all indents and fors and ifs is very important, and because the template sentences can be replaced by the n-gram sentences, I intend keep working on that code.

Short Story Project

Individual Work By Harrison 

Project: Short Story

Brief Description: A 4-page story that combines Frankenstein and Jane Eyre

In the intial stage, I would like to generate anthropomorphic stories by mixing the animal descriptions on wikipedia to some novels provided with great humanity. Later after experimenting a little bit with these two source text, I found them a little bit boring because the pedia description is quite plain and the novel is more vivid. The mixture struck me as incoherent instead of fantastic as I expected. So I decide to reserve the main idea of combines the styles of two articles. As I am recently studying on Frankenstein, I think “hmm, Frankenstein is also non-human creature.” So to make a interesting story, I focus at those elegant novels to distill details

Short Story

In this project, I mainly played around CFG. I built my own sentence structure and created my short story based on it.

Before we learnt CFG, I didn’t have a clear about what I should do. I used the mad list technique to create the sentence structure, and then replace the structure with the words in the original text. It didn’t work well.

Later, one project in NaNoGenMo 2016 gave me idea. The project is Dark I, which is a novel consists of different lengths of sentences. It began with sentences of 2 words, followed by sentences with 3 words, 4 words …


Why not do it by myself?

Instead of using “I” as the main character, I used Cinderella because I think”Nonsense” novel with a character which people are familiar with could be more interesting.

I built different sentence structures. You can see them in the picture above.

In this project, I really want to build complex sentences (Two single sentences connected by conj.) and attributive clauses. However, I don’t know how to add comma “,” between two different sentences. Moreover, since the sentence structure is too complicated, there are so many possible outcomes. Therefore, it took the much time to run the program. (Actually, it seemed like the outcome is endless). To save time, I let the program to generate the first 1,000,000 results. That’s why in the last chapter all the sentences start with “cinderella only became”.

There are other things I am not satisfied. First, the first word of the sentence is not capitalized. Moreover, in the code, I want to generate the sentence with depth of 6, however, it gave me sentences with 8 words. If I have more time, I will fix these problems, especially the first one.

Now I find generating language more interesting than I expected. Look forward to future projects!

Week 9: NaNoGenMo Project

This project was a little difficult to be creative with just because I felt like I was limited by what I understood of the tools and grammar. My original idea was to continue with my midterm project this time using NLTK to make it cleaner. It started to work then began throwing key errors again, so I decided to try using a similar approach but in a different order. I ended up first switching all the words in the short story I found over to words from Shakespeare’s works. I had some problems with getting full words sometimes though, I tried to catch them, but it didn’t really work. I think I am not entirely sure how to use the line I return in the function. I then used Markovify to mix up the text, I had to use a n-gram of two, because the grammar isn’t quite right for Shakespeare’s writing and it’s so random that it kept returning None with a higher n-gram. When I switched the words out I kept them in a dictionary to try and make the story more cohesive. That’s why the first line repeats itself. Then I wanted to see what happened when I tried summarizing it to see if it made any more or less sense. I think I would have liked to try to handle the grammar errors and random characters better, but I think overall the story seemed okay. I tried to add a moral to the story at the end and it was fun to see if it made any sense. The beginning is a little weird because it says the title over and over again. But I guess it would be interesting to put different texts through the program to see what happens, and since it is limited by the word diversity of the first text it would be interesting to play around with that.


Week 10: Short Story Project


I really like the text Fear and Trembling and have been wanting to do something relating to it for a while. There are two main themes in the book, “faith” and “resignation,” so I thought that would be a good way to focus this generated story. At first I wanted to section the generated text into three sections because that’s how I remember the book being divided, perhaps centered around three different characters, but when I was checking through the parts of speech and looking at the  text again, I realised the first section of the novel had Kierkegaard retelling the beginning of the story of Abraham in four different ways. This already had the qualities of a generated novel, so I thought it would be interesting to use what Kierkegaard had already written as the basis for the short story.


I ran a concordance on the whole text for two phrases, “faith” and “resignation.” The concordance function would only print the phrases, which I later learned I could adjust the  amount of characters and lines of text displayed, so I ran them separately, first on “faith,” then “resignation,”  saving the concordance in a separate file. After saving these files, I created a separate one with only the first section of text, the four separate re-tellings of the Story of Abraham.

Because Kierkegaard uses the story of Abraham to explore the concept of faith and it’s opposite, resignation, I thought I’d create a text similar to the Augmented Heart of Darkness, an entry for the 2016 NaNoGenMo. Instead of defining words not found in the Basic English Dictionary, I would use the phrases from the separate concordances to describe Abraham.

After looping through the entire text for “Abraham,” a random phrase related to “faith” would be inserted as a dependent clause after “Abraham.”  Because the concordance selected phrases based on the amount of characters in the phrase, the first and last ‘word’ in the phrases were often only pieces of words. I converted split the phrases into lists of words and deleted the first and last word in the list before putting them back together and inserting them into the Abraham text,  marked by commas. Then the words in the Abraham text were joined together into one line. This stitched together text was put into an array. When checking the output, it seemed that the text kept generating multiple copies of this text, so I used a counter in to make sure it would only generate one copy of the text. I repeated the process for “resignation.”


The concept “faith” appears in the text almost 5X as many times as the concept “resignation” and I wish that this interaction of the project represented that difference. There are four versions of the story of Abraham, I would have liked to use the two longer ones to include the text about “faith” and the two shorter ones as the frame for “resignation.” In addition, following Roopa’s advice, I wish I could have combined the texts in more meaningful ways. I had wanted to create three generations of this text, one of “faith, “ one of resignation, and one of both, mostly likely a phrase from each of the concordances linked by a conjunction, also pulled from the main text, Fear and Trembling. I would have also like to create grammars with text pulled from Fear and Trembling, I’m not sure how exactly I would have implemented the grammars, perhaps I could have experimented with adding structures to the beginning and ends of the phrases pulled from the concordance, however,  I underestimated how long it would take for me to do the initial part of the project.


For this project, I had the least help from fellows, ect, so I don’t think I realised the project in the way that I wanted to and there’s definitely room for improvement, especially since I really only used one technique from the NLTK Unit. It would be interesting to use more of the techniques from this unit to really drive home the dichotomy between the themes in Fear and Trembling. I struggled with trying to figure out how to use these tools in a creative manner and I feel like I finally have a more clear idea of where I could go to improve the project, but it would still require some more brainstorming. Nonetheless, I’m proud of the progress I made with this project and enjoyed working on it. I asked for  advice from someone to complete this project, but their python knowledge wasn’t much higher than my own and it was rewarding to struggle to finally get this final version of the text.

NLTK short story, a throwback to high school’s bumbling essays

For my project, I decided to experiment and see what would happen if I replaced all of the words in a text with synonyms of the words. This idea came to me as kind of a throwback to writing essays in high school, when I would look at the thesaurus for every other sentence in order to sound more sophisticated.

In order to condense everything down to the requisite 3-5 pages, I decided to use the summarizer function that we wrote in class. However, to make sure that it retained the sense of progression as a story, I summarize every chapter and combine the chapter summaries together into a single string. Next, I run this string through the synonymousify function, which goes through the synsets for everything that is not a stop word and replaces them. After everything has been replaced with synonyms, I run the summarize function on it again, this time outputting a summary that is 3 and a half pages long.

Originally, I was working with The Golden Compass, which I had used for my previous short story project. However, replacing the words in that novel weren’t very fruitful because as a sort of fantasy novel, there are a lot of words in it that are made up and therefore have no synonyms. Now I’m using Jane Eyre and I think the text suits the project quite well. This book uses a lot of really long, complicated sentences barely held together with a flurry of commas and em dashes. I think that kind of sentence structure compliments the style of a high school student trying way too hard to sound smart by replacing all the commonly used words with bigger words that mean mostly the same thing, and additionally, Jane Eyre is definitely one of those books that a student would be asked to write an essay on in high school lit class.

One thing I tried to do, but was unsuccessful with was working with stems. I wanted to use the Lancaster stemmer because of it’s relatively more general output (organization –> org, not organ), and replace words with the same stem with each other. I had managed to compile a dictionary together of stems and all the words in the text that share that stem. A lot of this dictionary was really boring, with a lot of the stems being the entire word and/or only having one word with that stem or variations of the same word (which is how stems are actually supposed to work). However, some entries in the dictionary were really interesting! For example, ‘oc’ was the stem for both ‘ocean’ and ‘ocultists’  and ‘rev’ encompassed ‘revere’, ‘revive’, and ‘revisit’. Unfortunately, I ran in to problems when it came to the actual replacement part of this idea, so it didn’t end up in my final project, but it is something I’d like to expand on in the future

Week9: Short Story: Neon Emma from Grandiloquent Dictionary


This is a generated literature of which the source text is the first 250 sentences of the novel Emma by Jane Austin. The program randomly picks word from each sentence, and substitute it with another rarely-used word. It reduces the readability, but adds a sense of humor and jesting to this famous novel and characters thereof.


I was to a large extent inspired by one project for NaNoGenMo, called Twide and Twejudice. The program changes words from Pride and Prejudice with popular words on Twitter. The result is that the characters live in such a classic novel seems to me speck in Twitter language, jesting.

I happen to have a copy of an unusual dictionary of Grandiloquent Dictionary. The book is filled with the least commonly-used English words that are long, hard to pronounce and have very specific definition. To give you a sense, “abecedarian:  a person who is learning the alphabet”. Instead of Tweeter language, I was wondering the product of mixing grandiloquent words with normal ones and the readers’ response to it.


In the process of looking up words to substitute with, it complies to at least one of the following conditions: 1. words of similar spelling. At most times two words have the same one or two initial characters 2. hierarchically synonymous. e.g. “politician” and “policeman” share more similarities than “politician” and “dog” because the same root for the former group is probably “person”, and that of the later is “mammal” which is a broader concept.



Short Story Documentation

For my short story assignment I created a python class called TextTools. This programme implements several functions, that segment the text in a predefined way, format and generate a new text based on the input.

First of all, the programme segments the text by the sentences and then segments the sentences by the stopwords, that are provided by nltk.corpus. This is done by the function self.to_phrases() which generates a list self.all_phrases , that groups the words in the sentence into list if they are not stopwords and into a single string if there are several consequtive stopwords in the text.

Secondly, the output of the to_phrases function is sent to self.tokenize_phrases() which essentially loops through the segmented text and tokenizes the words, which are not the stophrases/stopwords using nltk’s inbuilt part of speech tagger.

Next, the function self.markovify_patterns() creates two dictionaries: self.chainPossibilities, which is basically a dictionary with a key for every single POS-tokenized phrase in the text, and self.after_stopword, which has the stopwords and the stopphrases, that occur in the text for keys and the list of the following phrase (which can be a stopphrase) as a value. These two dictionaries allow us to use the next function in order.

Next function is called self.generate_mask(), which implements a markov chain and constructs a list of phrases and stopphrases, using the dictionaries of possible next values, which we created in the previous steps. The ‘mask’ of a sentence looks like this:

After the previous step all we have left to do is to translate the part-of-speech patterns from the sentence mask into actual words, for which we use the dictionary self.grammar created in the function self.to_grammar(). This dictionary maps every occuring phrase structure to the phrases, that satisfy it in the input text. The last step in the process of text-generation is the correct formatting of the output string, capitalizing he first letters after punctuation and putting line breaks.

As I was trying different corpora to train the model on, I realized, that if I use a corpora, which is less than 10-15 thousand words limits the variability of the text and it becomes very repetitive and obviously poorly chained. Because of this, I tried using the chunks from several books of my choice including ‘It’ by Stephen King, The Stupidest Angel by Christopher Moore and the Anarchist Cookbook by William Powell as well as a a combination of chuncks from the Anarchist Cookbook and The Stupidest Angel. From my experience, the output seems more coherent, when there is not many characters involved, not many dialogues occur and especially when a text is more of a list of actions, than prose.

Another thing that I noticed is the punctuation and special characters: from the very start i intended to remove everything, except the dots, however, due to some code issue, the programme did not completely remove the brackets, commas, quotations etc. At first I thought, that it was a major issue, however after reviewing several outputs I noticed, that the special characters are mostly consistent (brackets open and close, quotations are mostly corrrect), except for the several mistakes (I suspect this would occur if a phrase in quotations or brackets has a stopword in it). However, despite these issues, I feel like the punctuation adds more real feels to the output.



Short Story Project’s Documentation

Before beginning this project I knew I would not be able to generate a fully cohesive short story, so I chose Gutenberg’s bible to make things a little more interesting. There are websites like sandersweb and bibledice that randomly choose a verse from bible and display it, but I wanted to go a step further and randomly generate new verses (and passages) from bible using context-free grammar. As Gutenberg’s version of bible contains a lot of obsolete English words, I wanted to see if readers would simply attribute the obscurity of my generated story to the usage of these words, or would they be able to tell that this text was in fact generated by a computer.

Given the size of data I was processing, I decided to use pickle and split my code into 3 separate files. In the first file, I just sort words from the bible and add them to a dictionary according to the parts of speech they belong to. This dictionary was then saved to an external text file. The second file loads the dictionary from this text file and extracts nouns, verbs, adjectives, determinants and prepositions from it. Also, Regex is used to identify non-word strings that were mistakenly tagged by nltk as the aforementioned parts of speech. 160 grammar structures (with randomly chosen words) are constructed using these parts of speech for each of the following patterns:

These grammar structures are all saved to an external file, and the final file of my code uses this list of structures to generates sentences. Four sentences are generated from each structure, but only the fourth ones are displayed as the determinants and nouns in their verb phrases are different from those in their noun phrases. These sentences are further changed to lowercase with only their first letters capitalized. Whitespace characters are randomly added to the list of sentences to be displayed, and then the elements are shuffled and printed on the screen as paragraphs. During this entire process, I used nltk, pickle, random and re libraries.

The output did meet my expectations to a great extend as it does contain incomprehensible sentences, but on the whole, these sentences do seem to belong to the same story courtesy of the obsolete and archaic English words that make up this version of the bible. It is not exactly a ‘typical’ story, however, it is encouraging that the output does not look too different from some of the NaNoGenMo 2016 submissions. Here is a small portion of the output:

During this process I learnt that some punctuation marks can also be mistakenly categorized as nouns, verbs or any other parts of speech by nltk’s parts of speech tagging. I spent hours trying to figure out why my program was displaying an error even though the same code was running fine on other smaller texts files. As it turned out, it was happening because of the punctuation marks present in my lists of nouns and verbs. Just to be safe, I used a regex pattern to check for unnecessary elements in other lists as well, and this solved the issue. The following snippet of my code shows the adjustment I made:

Generative Language: Short Story Project

My aim for the short story project was to make use of some of the NLP techniques we have learnt so far in order to extract the structure of one text and then blend it with the vocabulary of another. After experimenting with several different texts, I settled on Jane Austen’s Emma for the structure and Herman Melville’s Moby Dick for the vocabulary, as I thought this would lead to particularly interesting (and entertaining) results.

At first, I decided to take a somewhat simplified approach, which was to replace the words in the first text directly with potential substitutes from the second text, based on a number of factors that determined “suitability”. This process could have also been modeled using a context-free grammar, but the results would have been the same, while the process considerably more complicated. I have tested a number of different techniques for substitutions, and my best effort was a combination of part-of-speech equivalence, Markov chains, and part-of-speech Markov chains. Unfortunately, the results were still far below my initial expectations.

I identified one of the most annoying sources of error to be the way that NLTK does its POS-tagging. While rather accurate in general, its results tend not to be accurate enough for my needs, as certain words that can be multiple different parts of speech would best be ignored when trying to randomly place them in a predefined structure. Thus, I made an interactive POS-tagger which allowed me to manually decide which tags were correct (or, to be precise, which words I want to keep in my vocabulary). This made a considerable improvement, especially as I settled on using only nouns and adjectives.

While the resulting text was now quite interesting and made a good amount of sense, I thought I could do a bit more to it. My next step was to select from Austen’s Emma all of the sentences containing Emma’s name. I then came up with a simple pattern to replace all of the occurrences of her name (and pronouns referring to her) into references to Moby Dick (which was my initial goal). Still, to make it more interesting, I also wanted to go another step further, which was to transform this resulting text into poetry.

I tokenized all of my selected sentences into phrases based on punctuation, and then used the “pronouncing” module in order to classify each phrase based on its number of syllables and rhyme. I then grouped all of these sentences in ascending order of syllables, and in pairs of two, resulting in my final (poetic) short story. The nature of the text (poetry vs prose) added an extra aspect of mystery to the story, which matched the already curious fact that a whale is being described as having very interesting humanly endeavors.

In the end, I was quite content with the result, and decided to stick with the rhymed version as my short story (since it is, nevertheless, a story).