As a student double major in BF and IMA, I want my GL final project can somehow combine these two majors. Therefore, I made a project comparing the use of words on Sina Weibo of two largest bike sharing business companies, OFO and Mobike to see whether there are any differences and if so, what are the differences.
My final project does following things:
1. get posts from Sina Weibo as source texts
2. tokenize these posts by meaningful word or phrase and get part of speech (POS)
3. count the frequency of each word or phrase
4. visualize all the data
Here is the final result.
The part with orange background stands for Mobike, while the yellow one stands for OFO. The width of each background depends on the total number of words used in the posts and the size of each word or phrase is decided by its frequency. More often a word or a phrase is used, bigger it is.
Apparently, OFO has more direct posts than Mobike (*Direct post: posted posted by the account owner, not a repost of other users). Even though Mobike posts less, it does more promotion than OFO. Keywords such as “红包” (red pocket), “月卡” (monthly pass), “骑行日” (green transportation), show more frequently in Mobike’s posts than OFO’s.
To make it look more interesting, I made another version.
1. Sina API ——————–
In order to get posts from OFO and Mobike, I created APP using Sina Weibo API.
Different from Twitter, Sina Weibo has a strict verification process. If you want to use Weibo API, you need to complete an application form, including the type of your APP, whether it is a mobile APP, a website, a game or other type of APP, a brief introduction of your APP, the website of your APP, and the icon and thumbnails of your APP.
I edited my APP application for more than 10 ten times in 1 week. Unfortunately, all of my applications were rejected with different reasons, such as “There are no Sina Weibo plug-ins on your APP website.” and “The thumbnails uploaded are not consistent with the APP website.”
Since my APP application couldn’t pass the verification, I was not able to get the access token, meaning that even though I can get my own timeline, I couldn’t get timelines from other users.
2. Crawler Program ——————–
I spent about 1 week on API, but there was no progress. Therefore, I started to think about other ways to get posts from Sina Weibo. Luckily, I found a tutorial. Following the instructions, I made the crawler program and got my source texts.
However, when comparing the source texts and the original posts, I found a problem: the texts generated by the crawler program were incomplete. The first picture is the home page of Mobike. If you compare the first post on the home page with the first post in the source texts, you will find them different. Just in case, I searched the first post on the home page in source text file, there was no match found.
After a discussion with Jack Du, we found that the crawler can only get the real posts posted by the account owner. If it is a repost, the crawler cannot get it. We review the source code of the crawler and found there is “ctt” in the code, while there are no classes named “ctt” in the source HTML.
3. Deal with source texts ——————–
I spent several hours to fix the problem of the crawler, but I failed. Moreover, I ran out of time, hence I move on to deal with the source texts.
At the beginning, I had no idea what to do. I started with defining keywords by myself and classify the posts in source texts.
However, after writing several lines of the code, I found the code is not “smart” enough. After all, it was me classify the posts manually.
Therefore, I started to search whether there are libraries I can use to tokenize Chinese, analysis the POS of Chinese. I did researches, and I chose a library called ‘结巴'(Jieba). It is a very interesting name because ‘结巴’ means stuttering. It works similarly with nltk, but with different syntax.
I tokenized posts into meaningful words or phrases, tagged them with POS, and count the frequency of each word or phrase.
4. Visualization ——————–
I finished collecting data (all the steps in section 3) on Monday night, and I started to visualize the data on Tuesday. I tried different libraries.
First, I used “text to image python” as search keywords, and I got a library naming “text-to-image”. However, the pictures it generated were not what I expected. Following are some of the pictures it generated.
As you can see, you can barely see them. They are just a few B&W pixels.
Later, I found a library called “pillow”. It can create a new image and draw text onto the image. However, I need to manually calculate the position of each word or phrase. It would be very difficult and inefficient. Considering Processing is better at visualization, I tried to do visualization in Processing. Unfortunately, it couldn’t recognize Chinese.
So I had to go back to ‘pillow’. I used the built-in module to calculate the text size and created a formula to get the position (the top-left corner) of each word or phrase and draw the word or phrase on the canvas. I tried different formulas and got several unsatisfying results.
Following are the final results. They are not 100% satisfying, but they are the best result I could get. Then I used Photoshop to put these pictures together.
5. Name the Project ——————-
The name of the project was from CFG. I played around the POS of Chinese.
The name is randomly chosen by the program. After running the program several times, I picked the best two.
Sum Up ————————————————————-
It was a very meaningful process. It is the first time I did so many researches by myself.
If I have more time, I hope I can fix the problem with the crawler so that the reposts texts can be included in the source texts. Moreover, I want to make the visualization more beautiful by exploring the usage the ‘pillow’ or finding other alternatives. The refined visualization should look like the following picture or other formats.
A more advanced version of this project is that it can do real-time analysis of the posts in Sina Weibo.