Content Analysis

The word wide web provides tremendous opportunities for researchers of the Chinese news media. Here are some useful links to electronic sources for collecting, managing, and analyzing Chinese media content.

Content Analysis Software

The yoshikoder is an open-source content analysis software compatible with both, Macintosh and Windows. You can download the software version currently recommended by the developer here and instructions on how to use yoshikoder on windows in English or in Chinese. To use the software with texts in simplified Chinese characters you need to use a tokenizer. On a Mac install the tokenizer available on the yoshikoder website; On a PC you need to link a tokenizer with each individual text document, called segmentation. Even for Macs I recommend segmentation using the yoshikoder segmenter (available from Will Lowe ) as it improves recognition of keywords included in the project file. Simply add a UTF-8 text file with your Yoshikoder project file before segmenting one or several articles. If the software still does not recognize Chinese characters you may need to run your system in Chinese. For tips on how to use Chinese characters on Macs see Yale’s website for Chinese on Macs; for PCs see mandarintools.com.

Note that yoshikoder can only process txt-files in UTF-8 format. To convert pdf-files into UTF-8 text encoding use yoshikoder converter; to convert txt-files into UTF-8 encoding you can use Chinese Encoding Converter. To prepare texts for segmentation remove spaces between characters using the replace function in Word or TextEdit. The same function can also remove unwanted paragraphs (type “^p” in the “find” field and leave the “replace with” field empty).

A dictionary for positive and negative Chinese terms can be retrieved from Ku Lun-wei at Academia Sinica. Jonathan Hassid has also used a dictionary with sensitive keywords frequently censored on the internet. For my own work I created USCATA dictionaries and LLCATA dictionaries which display Chinese characters embedded in Yoshikoder coding when downloaded to your computer. For use in Yoshikoder simply substitute the file extension “.txt” with “.ykp”. More information regarding USCATA and LLCATA is available under data on this website. If you created your own dictionary and are willing to share, please let me know by e-mail.

If you find the above information useful, please cite my chapter entitled “Information Overload? Collecting, Managing, and Analyzing Chinese Media Content” in Allen Carlson, Mary Gallagher, Kenneth Lieberthal, and Melanie Manion, eds, 2010. Contemporary Chinese Politics: New Sources, Methods, and Field Strategies. New York: Cambridge University Press. The chapter also contains more explanations relevant to using electronic sources for content analysis of Chinese texts. More information is available in the online appendix and in Yu-wen Chen’s JCPS article.

Data Sets

Contact me if you want to use USCATA (United States Computer-aided Text Analysis) or LLCATA (Labor Law Computer-aided Text Analysis). Information about these data sets is available at the data section of this website. USCATA data documentation includes a memo of problems we encountered and explains how we solved them. We provide tips for those who intend to use Yoshikoder to analyze texts in Mandarin Chinese.

Online Search Engines

Are listed in Table A.1 in the online appendix of “Information Overload?”.

Software for Qualitative Analysis of Chinese Texts

I haven’t used this software myself, but these programs can apparently be used to analyze Chinese text using qualitative text-analysis:

NVivo (Windows)
MaxQda (Windows)
Atlas.ti (Windows)
TAMS (Mac)
Scrivener, a writing software that can also do some qualitative text-analysis (Mac)

Information on Chinese Language Processing

Check out the Asian Federation of Natural Language Processing regarding more information on computational analysis of Chinese (and Asian) language(s).