Grace_Responce to What Is Generative Art? (Cici Liu class)

In his paper, Philip Galanter offers some working definition of Generative art, and then shows his readers that generative art can be a pure art form existing outside of computer, and hence is an independent art form. Different artists can use utilize generative art for different reasons, from different aspects, and to make different visual effects.

 

Although some people with their pre-assumed assumption think that complex systems cannot be purely simple, or otherwise it is not called a system, study and research have shown that when seeing a system too complicated, people will lose interest just as their reaction when they see a too simple and straightforward one. This means that a system organic and dynamic enough has to be one that contains some degree of complexity, but at the same time demonstrate some pattern that people can discover, or some underlying rule that catches people’s attention.

I strongly agree to this opinion: that there is no pure order or chaos in a complex system like generative system, and it is actually the seemingly paradoxical appearance and underlying principle that it has that makes people so fascinated about it.

This makes me think of an art piece in a permanent exhibition in M50 in Shanghai. An artist surnamed Zhang took photos of tens of thousands of people’s figures and put them into one giant photo in his postproduction.

The art piece looks like this:

 

and the close up looks like this:

At first sight I didn’t know what the picture is, except for the plaid patterns that is distinguishable from the distance. Only when I approached it and took close look did I realize how amazing it was that the artist edited so many photos together to make the huge one. It also has its meaning, at least to me indicating the massiveness of the world, how small an individual is when we’re in the crowd, yet how each of us constructs the world in our own unique way.

I also love the last part of the paper, and I’d love to quote it here:

“Generative artists can demonstrate by compelling example reasons to maintain faith in our ability to understand our world. The generative artist can remind us that the universe itself is a generative system. And through generative art we can gain our sense of place and participation in that universe.” – After all, it’s not only about visual itself. It concerns with human body.

Final Project Documentation: Generated Poems –Moma Collection

See my proposal here.

Why

It seems to me that artwork caption is a double sword for gallery persons and museum fans. A professional caption stresses highlight of the artwork, whereas some over-grandiloquent

What

The generated poems are a remix of artwork captions from the MOMA website. MOMA kindly provides a glossary of art terms. Click one term, it goes to the homepage of the term which writes a short introduction, i.e. this is the links of Abstract Expressionism. Below it are some representative artworks associated with the terms.

Given the rich source, I decide it better to generate one poem for one term, so that the outputs may different in the word choices and may reflect the concern and interest on the topic. For one term, the source text is the sum of all the introduction to the term and to all works.

How

  1. Save source text by scrapping with dynamic links.

1.1 Scrap the link for each term

1.2 In the homepage of a term, collect the introduction paragraph and save it to local files

 

 

1.3 Scrap the link for each representative artwork

1.4 In the page of the artwork, collect the paragraph of caption and save it to local files.

2. Analyze the source to figure out most frequent words and all possible rhymes.

 

3.Mix

Samples

MindLit: Final Project Documentation

Ideation:

Nowadays, even with the limitation of the number of words within tweeter ,weibo or wechat moments, I discovered that many people tends to be literary composing their tweets and moments. Like my uncle, he reads book all the time and then the wechat moments he composed sounds very appealing.

I was once also a fan of doing so but it turned out too time-consuming and also I lacked that accumulation of reading. So for my final project, I decided to make a tweet generator which turns many simple input sentences about users’ moods, what they have done today into a well-written tweet. I guess this could realize their wish of being like Hemingway, Fitzgerald or Qiuyu Yu.

MindLit processes in the following order:
1. get user input and tokenize it
2. run through the source text and get the top 10 sentences with the similar meaning to the input
3. sequence 10 sentences based on sentiment analysis
4. get UI and post it to Twitter

Process:

At very beginning, I decided to analyze the sentiment of the user input at first and match it with a source text. However, considering the fact that this user input is basically some simple sentences about daily routine I think any source text’s sentence could possibly be a match to it. So instead of separate different source texts I run through all the text within my source text and then tokenize it. To get the meaning of the sentence, I use the function of summarizer basically by calculating the frequency of the word and judge its relevancy to the user input. To realize this function, I have to elude the interference of stopwords and many other words like “I”,” you”,” the”,” in”,” on”, etc.

In the first version of my program, I did not apply lemmatized summarizer which turned out to affect the effectiveness of getting a relevant result very much. It is because people tend to have many grammatical tenses and transformations which may reduce the number of similar appearance of a single word and result in an ineffective match. At first, when I tried to get the distance between the input and every sentence within the source text, but the program couldn’t get through it because it takes too much time. So as an alternative, I select the top 10 sentences with the closest meaning to the input because I found that the gap between the sentences rank top 20 is quite huge which resulted in no chance for other sentences with meaning not that close.

After I get the first 10 sentences, I trained a sentimental classifier based on the texts of many proses and dump the result into the pickle. The critical point here is to combine the two scores resulted from the meaning detector and sentiment classifier. I manually set the ratio to 5:2 with meaning occupying the main part because I think the close meaning is the best output compared to a close sentiment but not very related to the topic.

At last, I set up a twitter developer account and get all the access and key. Also, I set up the UI in terminal and get the name of who the user wants to @ and asks for what status does he want to share?========

Post-motem:

I think there is still many tasks to do to realize my initial idea because for now it is more like a quote from the source text rather than a mixture of the user input and the quote from the literature. Based on this foundation, I think my next step is to apply the quote I get from the program to the sentence structure of the user input or replace the simple part of the user input. Basically, this is a good experience of explore natural language processing and set up the foundation of my future development.

Final Project

Instructor: Roopa Vasudevan
Class: Generative Language
Project: Final Project
Description:
I created a song lyric generator that generates songs based on essays about home. I have a collection of songs about home because I noticed that many  songs across time period and genres, from the spiritual, “Swing Low Sweet Chariot,” to Blink 182’s, “All the Small Things”  tend to mention the concept of home. The way these songs describe home has been very interesting to me because I wonder what home means in all these songs and where there’s cross over between meanings. I wanted to create a project that worked in this medium and explored the definition of home in song lyrics.

Process:

I collected a series of song lyrics that contain the word home.  At first I had considered using these song lyrics in letters to home or journal entries, however, all my journals from the past few years were three hours travel away and I wasn’t sure how the process of converting handwriting to text would play out. I originally wanted to do something more closely aligned to the meanings of the songs, so I created a list of keywords associated with the songs, such as belonging/un-belonging, loneliness, love, people, and being lost or wandering.

After talking with Luis about my project, he suggested I use Thought Catalogue posts about home as my corpus. I really appreciated this input because I wanted to stay more closely aligned to the songs about home. One thing I noticed was that the thought catalogue posts tended to more clearly define home, so I wondered if that affect the meanings of the songs and make them more literal when they mention home. In the end I would use the NRC sentiment analysis library to analyze each sentence in each the Thought Catalogue Post to determine the sentiment of the sentence. The words in this sentence would then be tagged with their part of speech using the NLTK Part of Speech tag function. I made a few lists of of parts of speech and sentiment, for example “angerNouns.” All nouns in a sentence where the sentiment is determined to be “anger” would then be added to this list.

The sentences would be analyzed for sentiment using a modified version of the sentiment analysis exercise we did in class. Using the NRC dictionary, each word in a sentence will add a point to the overall sentiments found in a sentence. One sentence may be “3 positive, 1 anticipation, 1 joy” based on the words in the sentence.  The sentiment with the most points will then be chosen as the overall emotion for that sentence, and the words will be part of speech tagged then added to their corresponding list. This part of the program sorts and collects all the information from the corpus text and prepares it for the next part of the program.

A similar version of the sentiment analysis would be run on each line of the song, analyzing it first for sentiment. Once the sentiment in the song has been determined, each word in that sentence will be tagged by part of speech. Finally, the words in the sentence would also be tagged by part of speech then nouns, verbs, and adjectives will be replaced from the corresponding lists created from the Thought Catalogue posts.

Final Result:

It was recommended that I  start getting the basic functionality of the program running: the sentiment analysis and word replacement figured out. This is what focused on and was only used one song as a grammar where lyrics could be replaced. I was surprised by the result because as I had mentioned earlier, I wasn’t sure how the Thought Catalogue posts, with their more clunky way of defining home (for example, “This is What ‘home’ Actually Means Because it’s Not Just Where You Sleep and Shit”) would integrate into the song lyrics. Using the original songs as a template for the actual songs helped keep the definitions of the songs vague and offered a different perspective on home.

As I was reaching this point, the point where I had initially thought that I would be able to finish the project, I realized there was a lot more that I hadn’t considered. For example, how would I deal with constructing a chorus, ensuring that there was one main theme threaded throughout the each generated song, and what about rhyme. The project at once seemed to expand and grow into something much more expansive that I had originally anticipated.

 

Reflection:

I was disappointed that I didn’t finish the project as I had set out to finish. When working on this project and dealing with sentiments about home, I kept coming back to the idea of a cross stitching a favorite generated song lyric onto a throw pillow. That was my personal goal for the project, to get the program up and running enough that it would be possible to have some sort of final output that especially resonated with me for whatever reason. It was frustrating that despite all the time and effort I put into the project, I wasn’t able to complete it to the extent that I wanted it to be completed. I would say that the result that I got it to was to a point of a first prototype.

I wonder if part of struggle with this project came from being too tied to the song lyrics. I’ve been wanting to do something with this idea for a while, even once writing a paper about this topic, and that may have gotten in the way of other more tangentially related ideas. Maybe I could have made a travel log where the author was very homesick and used lines from there. I feel like I may have limited myself when conceptualizing the project.

As a final project, the last thing that I created after a semester of learning, I’m satisfied with how it turned out. This was definitely the most code intensive class I’ve ever taken and I think that I ended up learning a lot, both about programming, but also language and wiring. Before this class I wouldn’t have even known where to begin, for this project I struggled a bit with scale and conceptualizing the project, but in terms of my actual abilities I improved a lot and overall I’m very proud of my personal progress.

 

GL FINAL

One of my favourite poets is Mario Benedetti, an Uruguayan who is well known in Latin America for his writings about exile.

The original purpose of the project was to translate a collection of his poems in a way that would be faithful to his writings, and that would preserve the original meaning of the poem. In order to do this, I used POS tagging and wordnet to get synonyms —

from there, several things could be done; because I had a list with the synonyms I could have used the words to construct a new poem that had any sort of structure I wanted to give it.

Instead of using what I knew about the poem (i.e. in terms of its linguistic structure), my first ideas were all related to statistics and probability — I thought about generating CFG, about using Markov chains, and I eventually thought about neural networks.

 

One of the consequences of generative text manipulation is loss. The original is fragmented and ‘transposed’ – it is exiled to another form. Although I knew it was a bit far out, I wanted to explore the idea. I had been playing with various forms of RNNs all week and I found the results really interesting.

Recurrent neural networks or RNN are a form of LSTM (long short-term memory) network. Information is represented by vectors (and they work by doing matrix operations), which is why the code can look so complicated.

The RNN that I used for this project is what is referred to as a ‘Vanilla’ RNN – it accepts a fixed-sized vector as an input and produces a fixed-size vector as an output. ‘Real’ RNNs operate with sequences of vectors (i.e. as opposed to a fixed number of steps) – thus they learn much faster.

The ‘Vanilla’ RNN:

vanilla_RNN

Hidden layers (ht from previous layers (ht-1) are stacked — this is the memory of the network. Each hidden layer depends on the corresponding input at that timestep (t) and the previous timestep (t-1). It usually does so by using a function such as f = tanh

ht=f(WxxWhht1

and the output (i.e the sentences the RNN is generating) is computed using only associating hidden layer (i.e. in that sense the RNN is Markovian).

yt=softmax(Wyt)

I used the original poems (right), Chinese translations (centre) and the English translations (left)

It is possible to see that the network learnt some of the shorter words.

Some short, common words are recognisable in the English translation: ‘and’, ‘they’, ‘I look’, ‘once’, ‘like’, and so on. The same applies to the Spanish version.  Interestingly, the format is kept – verses tend to be short, and the grammar seems to be inherited as well (i.e. ‘I look’, ‘of the’, ‘in the fall’). Considering that this is a program that is learning one character at a time, I think this is very interesting.

Chinese is entirely a different case, because a) the translations need different grammatical considerations and b) because the RNN learns individual characters, if grammatically correct Chinese text is used, it will actually be absorbed faster than any alphabetical language.

While exploring possible ways to generate new poems, I learned a lot about NLP and its applications. I reviewed much of what we learned in class and greatly improved my coding skills. Although I think that neural networks are a great application of NLP,  and although I think that as an idea, exiling the poem to  ‘adapt’ to a foreign environment and a foreign language is quite poetic, perhaps this is not the path I should have followed.

In my original proposal I mentioned Pound’s translations as a source of inspiration, and as Luis (was it Rune?) mentioned today in his feedback, I could have done more work with the actual language of the poem in order to convey the same idea.

 

Final Project

For my final project my idea was to take Chinese text and turn it into some color based thing. The only problem was I decided to do it in Chinese which made calculating things a lot harder. I think I did eventually decide on a color scheme which was calculating the sentiment when referring to people, but unfortunately I ran into difficulties with processing and p5 importing text. I did end up with a lot of data though. I tried working with grammar for a long time and Markov chains, by turning the Chinese phrases into parts of a sentence, I couldn’t figure out how to substitute it back in well and only got it to work in the terminal. So I changed my idea again to use the data I had to make your own story guiding the mood of the story. I wanted to do more visuals but processing ran into memory problems and I wasn’t sure what to do because of the school computers. I also organized it by who said it and was going to float it up, maybe with pinyin in p5 or processing but pinyin translation takes a long time. Even with using pickle. I did get some text outcomes I liked, but I think I got caught up in the idea of making things animated. I did get things to scroll across the screen but I don’t know what happened when I reopened the code it didn’t work any more.

If I had thought it through more I think I would stick to English. I did use HTML scraping which I hadn’t used before. And I think if I stuck to the grammar thing I was pretty close I had sorted them by part of speech but there were so many parts I couldn’t find a guide. And when I went to switch them out it was by line. So I have a lot of data sets with Chinese, pinyin, sentiment, and keywords, but I didn’t get much more than text, and graphical representations.

Edit:

My main goal for this project was to turn a piece of writing into a visualization. I originally wanted to use songs because I thought that it would be easier on grammar, but I wanted a certain amount of randomness to it so I set about looking for a lyrics API. While doing that I realized I liked Chinese songs better than English ones because I think they tend to be more poetic. I spent a day trying to use HTML scraping on various Chinese lyric sites to no avail. Then I remembered a site I used to read called ONE, so I decided to try scraping that. It took a while because for that project I used an API instead of scraping. But I figured it out and then went to go find a way to parse the text.  researched a lot of different modules, before settling on SnowNLP because it had Hanzi to Pinyin which I thought about using for a while, it had parsing, sentiment, and keywords too. I didn’t end up using most of these, but they were fun to play with.

For a while I got overwhelmed by data and just played around with it in graphs, but I wanted to find some way to create something with it and sort of better control it. At first I wanted to let the user pick the direction of the story, but I realized that would be very slow going to I decided to try to find songs for it to match and match the notes to the sentiment. I thought about using a python sound analyzer, but it seemed a bit much so I stuck with just finding songs and writing out the notes myself. This time I decided to run it in p5 because I am more familiar with it. I first organized the songs by sorting it by sentiment value mapped out to 800 for different notes, then I picked a song based on the texts content and went through the part of the song to pick lines. Then it would spit out a story, and 3 text files for p5 to use. One had the sentiment and the previous speaker to show who was talking. The other had the overall tone between the me, you , and them in the story. And the last was the song so it could be played simultaneously. While it is not perfect I like the output. I changed the colors of the bubbles based on the individual sentiment, the inverted sentiment, and the speaker.

Generative Language Final Project Documentation

Idea: Take a podcast (with transcripts), analyze the language in it, and display visuals to match up with the audio.

Why: People keep telling me to listen to all sorts of different podcasts, but I usually just drift off and stop paying attention when I try to listen to one.

Slightly More Details: I specifically used The Adventure Zone (TAZ), a Dungeons and Dragons podcast, because it has a large and active fan base (even more details in the Process section). Also tabletop games in general tend to involve a lot of describing things with words and leaving the visuals up to the players’ imaginations.

Process: I actually managed to use quite a few different techniques that we learned in class! I shall talk about them in the order in which they appear in the code.

  • Regular Expressions: In the transcript, there were a lot of useless things, mostly a lot of interjections of “umm” and “uhh” as well as square brackets demarcating actions such as [laughter]. I used regular expressions to remove all of these before doing anything else with the text. I also used them to separate when someone was talking normally v.s. speaking while roleplaying their character.
  • NLTK (POS stuff): analyzePrepositions() is one of the most important functions in the entire program. I iterate through every line in the transcript that isn’t explicit roleplaying and pos tag it. If there are prepositions in the sentence, then I try to figure out the noun before it and the noun that follows it (ex. “You manage to hoist this wolf into the air” extracts out ‘wolf into air’). All of these nouns then have a staging object created from them with self.name (the first noun), another object that it is located relative to (the word following the preposition), and where it is relatively (the preposition ex. ‘above’). Each of these objects is then stored in a dictionary.
  • API (Tumblr): I use the tumblr API in order to find images to represent the staged objects. Originally, I tried using twitter, but most posts there were text based. findMedia() takes in the noun and the part of speech and then searches tumblr for posts that are tagged with that noun. I get the url of the photo from the API and then use url lib to download it.
  • JSON:  I dump all the info from my staged objects into a json and load that json into Processing. Wow, learning how to use json in Processing was intense. There are a lot of json specific methods and a list of json objects is its own special json array type. Most of my time in Processing was spent trying to figure out how to use jsons instead of working on visual stuff 🙁
  • Processing: Processing pretty much just reads data from the master json. It draws the images using the file location reference from the json to the co-ordinates stored in the json. The majority of the code here is just making sure that the index references are correct.
  • you can see that it processes ‘crate near center’ the dog is in a crate so it sort of makes sense, but also is a bit silly.

Various Setbacks: The visualization itself was the hardest part. I had planned to use a library for p5 in python, but the library is still in development and didn’t support critical functions like image() and text(). Figuring out when to draw what on the screen was very difficult for me to wrap my head around. Right now I’m going for a rough amalgamation of math based on the number of lines in the transcript, number of seconds in the podcast, and the frameRate/frameCount in Processing. Verbs are much harder to derive meaning programatically than nouns. JSON is simultaneously a very elegant way of storing data, but a little difficult to get started with as a complete beginner.

Current Bugs: Some objects seem to be missing. There should be 6 of each timestamp in the json, but not many have the full 6, and some are totally missing. findMedia() stopped working for proper nouns.  Need to sync audio and text/images, though this is likely related to the missing objects. Images overlap a lot, might help to do the relativity in processing and take into account the image size for proportions.

Number One Feature Improvement: I originally only intended to use tumblr to get images for TAZ specific nouns (basically all of the proper nouns) and use google images for more common place nouns. Tumblr can have some… weird images for certain search terms, and other times the relationship between the picture and the tags is just dubious at best. Splitting up findMedia() like this would lead to more consistent and less nonsensical results. Unfortunately, I didn’t really have the time to learn another new API.

What I Learned:  The tumblr API, a deeper understanding of dictionaries probably, storing object data from Python into JSONs, downloading images from websites with code, importing JSONs into Processing, getting data from JSONs in Processing. JSON. JSON. JSON.

Conclusions: It’s actually quite hard to say if my project met my expectations or not. On the one hand, I ran into a lot of issues and the final result is not nearly as comprehensive as what I was imagining in my head. On the other hand, pretty much everyone that I told my idea to said it was either “really difficult” or “quite ambitious” (both essentially meaning I was in over my head, with varying levels of encouragement), and I’m honestly kind of amazed at how far I made it in. This is the most code I’ve ever written for a single project, and even though it’s not extremely well polished, the bugs are very few. Also see the above What I Learned section. I feel a lot more confident in programming after this project (and also comment my code a bit more).

Interesting Stuff To Consider: My intention with this project was to create a tool to try and visualize podcasts. However, while I was programming analyzePrepositions() and trying to figure out how to determine where objects were in relationship to one another, I started wondering about how my project could work as a tool for directors and actors to use while staging a play. I’m curious to see how an implementation of my project that goes through a script and visualizes according to the stage directions would work.

Final Project: 摩拜可能嗨&OFO表白嗯

Summary ————————————————————-

As a student double major in BF and IMA, I want my GL final project can somehow combine these two majors. Therefore, I made a project comparing the use of words on Sina Weibo of two largest bike sharing business companies, OFO and Mobike to see whether there are any differences and if so, what are the differences.

My final project does following things:
1. get posts from Sina Weibo as source texts
2. tokenize these posts by meaningful word or phrase and get part of speech (POS)
3. count the frequency of each word or phrase
4. visualize all the data

Here is the final result.

The part with orange background stands for Mobike, while the yellow one stands for OFO. The width of each background depends on the total number of words used in the posts and the size of each word or phrase is decided by its frequency. More often a word or a phrase is used, bigger it is.

Apparently, OFO has more direct posts than Mobike (*Direct post: posted posted by the account owner, not a repost of other users). Even though Mobike posts less, it does more promotion than OFO. Keywords such as “红包” (red pocket), “月卡” (monthly pass), “骑行日” (green transportation), show more frequently in Mobike’s posts than OFO’s.

To make it look more interesting, I made another version.

Process ————————————————————-

1. Sina API ——————–

In order to get posts from OFO and Mobike, I created APP using Sina Weibo API.

Different from Twitter, Sina Weibo has a strict verification process. If you want to use Weibo API, you need to complete an application form, including the type of your APP, whether it is a mobile APP, a website, a game or other type of APP, a brief introduction of your APP, the website of your APP, and the icon and thumbnails of your APP.

I edited my APP application for more than 10 ten times in 1 week. Unfortunately, all of my applications were rejected with different reasons, such as “There are no Sina Weibo plug-ins on your APP website.” and “The thumbnails uploaded are not consistent with the APP website.”

Since my APP application couldn’t pass the verification, I was not able to get the access token, meaning that even though I can get my own timeline, I couldn’t get timelines from other users.

2. Crawler Program ——————–

I spent about 1 week on API, but there was no progress. Therefore, I started to think about other ways to get posts from Sina Weibo. Luckily, I found a tutorial. Following the instructions, I made the crawler program and got my source texts.

However, when comparing the source texts and the original posts, I found a problem: the texts generated by the crawler program were incomplete. The first picture is the home page of Mobike. If you compare the first post on the home page with the first post in the source texts, you will find them different. Just in case, I searched the first post on the home page in source text file, there was no match found.

After a discussion with Jack Du, we found that the crawler can only get the real posts posted by the account owner. If it is a repost, the crawler cannot get it. We review the source code of the crawler and found there is “ctt” in the code, while there are no classes named “ctt” in the source HTML.

3. Deal with source texts ——————–

I spent several hours to fix the problem of the crawler, but I failed. Moreover, I ran out of time, hence I move on to deal with the source texts.

At the beginning, I had no idea what to do. I started with defining keywords by myself and classify the posts in source texts.
However, after writing several lines of the code, I found the code is not “smart” enough. After all, it was me classify the posts manually.

Therefore, I started to search whether there are libraries I can use to tokenize Chinese, analysis the POS of Chinese. I did researches, and I chose a library called ‘结巴'(Jieba). It is a very interesting name because ‘结巴’ means stuttering. It works similarly with nltk, but with different syntax.

I tokenized posts into meaningful words or phrases, tagged them with POS, and count the frequency of each word or phrase.

4. Visualization ——————–

I finished collecting data (all the steps in section 3) on Monday night, and I started to visualize the data on Tuesday. I tried different libraries.

First, I used “text to image python” as search keywords, and I got a library naming “text-to-image”. However, the pictures it generated were not what I expected. Following are some of the pictures it generated.

As you can see, you can barely see them. They are just a few B&W pixels.

Later, I found a library called “pillow”. It can create a new image and draw text onto the image. However, I need to manually calculate the position of each word or phrase. It would be very difficult and inefficient. Considering Processing is better at visualization, I tried to do visualization in Processing. Unfortunately, it couldn’t recognize Chinese.

So I had to go back to ‘pillow’. I used the built-in module to calculate the text size and created a formula to get the position (the top-left corner) of each word or phrase and draw the word or phrase on the canvas. I tried different formulas and got several unsatisfying results.

Following are the final results. They are not 100% satisfying, but they are the best result I could get. Then I used Photoshop to put these pictures together.

     

5. Name the Project ——————-

The name of the project was from CFG. I played around the POS of Chinese.

The name is randomly chosen by the program. After running the program several times, I picked the best two.

Sum Up ————————————————————-

It was a very meaningful process. It is the first time I did so many researches by myself.

If I have more time, I hope I can fix the problem with the crawler so that the reposts texts can be included in the source texts. Moreover, I want to make the visualization more beautiful by exploring the usage the ‘pillow’ or finding other alternatives. The refined visualization should look like the following picture or other formats.

A more advanced version of this project is that it can do real-time analysis of the posts in Sina Weibo.

CrazyBot: Final Project Documentation

My final project is a chatbot, that operates on the emotional content of the user’s input and outputs a stream-of-thought that is hopefully matching the input in the sentimental way.

The original idea was a chatbot that can summarize the opinions of the Twitter users about a certain topic, however, it wasn’t meant to be, because even streaming 60 megabytes worth of tweets (around 15000 messages) did not provide enough topic-segmented corpus adn while some rare common topics as ‘love’ are relatively popular, with a decent number of tweets, the others have less than 10 sentences at best. Another idea was to receive ebough “Original tweet – Response Pairs” nad then train a recurrent neural network on this labeled data in order to create a ‘general purpose’ chatbot, however, it took too much time to compile a decent number of pairs and therefore it didnt leave enough time for training of the NN.  Therefore a different approach was implemented, taking advantage of the limited number of emotions, that NRC emotional lexicon contains. The tweets were mapped to its emotions and because we lost the ability to effectively drive the topic of the conversation by carefully picking the corpus, the output of the bot became looking more like like a blabber, compiled of different things users on twitter say. There is a solution, of course: inflate the corpus merely by copying topic dictionary several times, however that would only inlude a lot of very similar unnecessary word structures and therefore the corpus that is sent to the generative model is compiled of all the topic-mapped sentences plus a chunk of the shuffled emotion-appropriate tweets. The size of the corpus is limited to 800 because even though large corpus increases the probability of receiving something meaningful in a sentence, it also increases the computation time dramatically.

HINT: giving more concrete topics to the generator improves the results.

***

The topics are extracted using the StanfordDependencyParser one needs to change the path to the package manually on line 41, for example: path = ‘C:\\Program Files (x86)\\Stanford Parser\\stanford-corenlp-full-2017-06-09\\

***

After the corpus is compiled, it is passed through markov-chain-inspired class TextTools, which tokenizes the tweets by sentences, splits the sentences by stopwords as well as punctuation(but leaves the punctuation in the corpus), then pos-tags those chunks (phrases), which are not the string of stop-words and applies a markov chain on the resulting corpus, that consists of sequences of POS-tagged phrases and stopwords. It is important to note, that the length of the POS-tagged sequence is manualy limited to 4 words at most, therefore it can be said, that an N-gram of 4 is used here, however that would not be precise, because uninterupted POS-sequences of 4 and more words are rare and usually indicate a misspelling, that is not tracked by the tools I am using. Also, the n-gram may be controlled on line 729.

Afterwards, the sequence is pretty much the same to my Midterm poject. In short, the ‘mask’ of a sentence is constructed using the common occurences in the corpus and then encoded into natural language. Out of multiple generated sentences only a limited amount is chosen by random sentimental analysis. With probability of 60% the most negative sentence would be added to the final chain. In fact, the variables of n-gram and the size of a mask can be controlled on line 715.

An unexpected difficulty was representing the emoji characters in the Tkinter GUI, because TKinter does not support UTF-16, therefore the characters beyond the range of UTF-8 need to be converted somehow, which is done by extracting the character’s unicode data and plugging it in the string, converting it to ascii.

Overall, the bot shows relatively good comprehensiveness and sometimes peculiar sentences, but as I mentioned earlier, small relation to the topic is explained by small corpus, which is also a limitation for using the Recurrent Neural Network-based model.  Due to the same reasons the output seems more like a stream of thought rather than an actual thing a human would say. There is definitely room for improvement in the generative section, for example, a large enough neural networkwould be way more effective.

SAMPLE OUTPUT:

CrazyBo0_0t:
Hello, I am sliiiightly crazy…hah…Thats because I looked at too many tweets and now my brain is fried by you, humans!!! My favorite topics are….I dont remember, but I actually don’t give a damn what humans think anymore LOL LOL

 

Human: People are writing pleas to the state of Alabama on the day of its Senate race.

CrazyBo0_0t: Learned for die. You more hope you, have a pretty no democratic party activist are so honor Has is night with the hug him and know you more love until 274 used Boy was scoring my hate you have the kgb wanted by his day my given day with a kid”. Have a horrible manager at of you, hope someday to debut you his others.

 

Human: Thinking big takes the same amount of effort as thinking small

CrazyBo0_0t: Absolutely love waiting in my love is #bills game. You’re and love to office tells to love y’all in the life. How lovable and i i it’s not omg you run we, use is we’d love he better be past. Fucking love i socialist dhs over 17th birthday is in um excuse in 1. He was beautiful voice but counter attacking united will he’s me, and then hope harry so top websites at all. Awesome on the trust your martial. Embarrassingly negative.

 

Human: so what was that lousy thing, that trump said about jerusalem and palestine?

CrazyBo0_0t: In this january. City, your kim tensions. S you have a in a love you new purchases make what sometimes a egomaniacal narcissist to have #manchesterderby. I still like it is love to what always good of this love we can sad it days you and jordanian see it from in our ultimate compliment. Love and iss. To be once You come me too brainwash and didn’t heard so worried about. Happy birthday princess. From the had a meal you so wrong of the Guy it, but fergie but egomaniacal narcissist i will not love someone up the look you christmas now cause this [HEAVY BLACK HEART][?]. Then break been much your best players on a good people[HEAVY BLACK HEART][?]. Hope u like it actively targets since no other season you have a gotten lucky there”.

 

Human: songs are usually written about north korea and other puzzling things

CrazyBo0_0t: Make Will war no things. ! You know With baby to love of the because decent tools their smiles. Love <333333 of love [YELLOW HEART]. [TWO HEARTS]. Snowy sunday afternoo…… He members how ure one in hurt participants and wasnt and special place who things their Dire an goal difference and you them drainining I love to love you members you things). To kenya, we him.

Final Project’s Documentation

For this final project, I decided to make a twitter bot that would sound exactly like a human user. My intention was to collect live tweets using hashtags, and then generate new tweets using markov chains, regex and context-free grammar. However, after including context-free grammar, the tweets were not really replicating the way people talk over the internet (e.g. usage of slang, incomplete sentences/phrases, abbreviations). So I decided to proceed without context-free grammar. I also initially extracted links to external pages from the tweets, and then randomly assigned one to some of the posts. But then this was generating misleading posts, and as my intention was not clickbaiting, I decided to not separate links from text in tweets.

My program is basically structured into 4 distinct sections: collection of tweets, applying Markov chains to collected tweets, generating new tweets, and posting the computer-generated tweets to a twitter account. Here’s what some of the collected (original) tweets looked like:

The first 100 tweets are collected that contain a hashtag randomly chosen from a list of hashtags that the user is interested in, and then 1 new tweet is generated from them using Markov chains and a 2-gram. However, to give the bot a ‘personality’, all these tweets are collected into a file so that the bot can also consider these past tweets on a certain topic when generating new tweets. This was done using pickle as shown below:

The program does not stop after posting one tweet, but keeps running for as long as the user wants it to run. Also, it checks if the new tweet is within the character limit of Twitter. If not, then it chooses another generated tweet that satisfies this condition. During this process, the retweet characters (‘RT’) are also removed.

I am extremely satisfied with the results as most of the tweets sound like they were composed by a human user, and not a bot. Generally, the tweets are grammatically correct, however even when they are not, they just come off as typing mistakes. My idea of tweeting about a topic that is being actively discussed, and reflect/support the current sentiment did come to fruition. These are some of the tweets that were generated using #munvsars (Manchester United vs Arsenal), #Jerusalem and #bitcoin hashtags respectively: