Data Loading...

FACILITATING STATISTICAL ANALYSIS OF DIGITAL TEXTUAL DATA ... Flipbook PDF

FACILITATING STATISTICAL ANALYSIS OF DIGITAL TEXTUAL DATA: A TWO-STEP APPROACH Svetlana Stepchenkova Department of Hospi


113 Views
77 Downloads
FLIP PDF 80.92KB

DOWNLOAD FLIP

REPORT DMCA

FACILITATING STATISTICAL ANALYSIS OF DIGITAL TEXTUAL DATA: A TWO-STEP APPROACH Svetlana Stepchenkova Department of Hospitality and Tourism Management Andrei P. Kirilenko Department of Forestry and Natural Resources and Alastair M. Morrison College of Consumer and Family Sciences Purdue University ABSTRACT Studies of holistic destination images involve analysis of qualitative data. The paper proposes a two-step methodology that facilitates statistical analysis of texts. First, the variables of interest are identified in textual files by using CATPAC software. Then, frequency count of the occurrences of these variables in every file is conducted by WORDER program. Special issues of content analysis, such as synonyms, multi-word concepts, plural-singular, etc., are efficiently handled. Methodology application is illustrated on three examples of destination image studies. The place of the proposed approach within the quantitative data analysis methods is also discussed. Key Words: Destination image, content analysis, digital textual data, CATPAC, WORDER, organic image. INTRODUCTION Analysis of destination images provides important insights into consumer travel behavior; however, the composite nature of the destination image construct presents great challenges for its measurement. Traditionally, a strong preference has been given to structured methods of image measurement (Echtner and Ritchie, 1991; Pike, 2002). While structured methodologies have a number of advantages over qualitative methods, they focus on particular destination attributes and generally neglect the holistic aspect of destination image. The measurement of holistic images, whether stereotypical, affective, induced, or organic, involves content analysis of textual and/or pictorial materials. A number of destination image studies employed sorting and categorization techniques to identify the frequencies of certain words, concepts, objects, or people, and treated the most frequent ones as variables, or dimensions, of the destination image construct (Echtner and Ritchie, 1993; Dann, 1996; MacKay and Fesenmaier, 1997; Tapachai and Waryszak, 2000; Andsager and Drzewiecka, 2002; Echtner, 2002). While the unstructured methodologies of destination image measurement often contain a quantitative element, they do not facilitate statistical and comparative analyses of destination images (Jenkins, 1999), and are time consuming and subject to the biases of the researcher. Despite the fact that destination image studies made a number of theoretical advancements since the 1970s, some of the concepts, e.g., organic image, have not been sufficiently tested on and practically applied to the available data because of the lack of a well-defined procedure for such a study, and large amounts of text involved in the analysis. A desired procedure that would facilitate this type of content analysis should meet at least three criteria. First, the procedure has to be computer-assisted, taking into account the increasingly large amounts of textual data available in digital databases and on the Internet. Second, it should aid in identification of image variables from a certain theoretical perspective. Third, this procedure should be able to process a statistically representative sample of textual blocks, so that the numerical response on the selected image variables could be generalized on the whole population of texts. Additionally, it is highly desirable if the procedure would assist in “cleaning up” the data, a tedious task that often precedes the actual content analysis. The proposed two-step approach efficiently addresses all these problems. In the following sections, the paper describes the procedure and gives examples of its implementation in three destination image studies: Russia’s stereotypical and affective holistic images, and comparison of organic images of China and Russia. CONTENT ANALYSIS There are two general classes of content analysis methods employed in social sciences – qualitative and quantitative. The former term refers to non-statistical and exploratory methods, which involve inductive reasoning (Berg, 1995), and the latter term refers to methods that are capable of providing statistical inferences from text populations. Content analysis examines textual data for patterns and structures within text, develops categories, and

aggregates them into perceptible constructs (Insch and Moore, 1997; Gray and Densten, 1998). Content analysis is capable of capturing a richer sense of concepts within the data due to its qualitative basis, and, at the same time, can be subjected to quantitative data analysis techniques (Insch and Moore, 1997). A central idea of quantitative content analysis is that “many words of text can be classified into much fewer content categories” (Weber 1985:7). The methodology of extracting content categories from the text and counting the occurrences of the themes in the sampled text blocks was developed by the mid 20th century, and is often referred to as contingency analysis (Roberts, 2000). George (1959), one of the pioneers of content analysis, was critical of the use of contingency analysis, saying that this method was not sensitive enough to reveal how themes were used in the sampled blocks of text, and what meaning the speaker had intended for them. Indeed, contingency analysis assumes that “what an author says is what he means” (Pool 1959:4), and cannot take into account figures of speech, irony, or sarcasm. Moreover, the semantic structure of the language is also disregarded. Therefore, occasionally, the theme count is inflated, and data yields inaccurate results. However, despite the critique, contingency analysis, whether computer assisted or done manually, has long been employed in social studies due to its clear methodological reasoning based on the assumption that the most frequent theme in the text is most important. These two dimensions, interpretational and structural, became a basis for a 2x3 taxonomy of quantitative content analyses suggested by Roberts (2000). The interpretational dimension reflects the perspective from which the textual data is interpreted, i.e., that of the speaker or the researcher’. In the approach from the speaker’s, or representational, perspective texts are used to identify the speaker’s intended meaning. If the researcher’s perspective is dominant, texts are interpreted in terms of the researcher’s theory. Having followed the discussion by George (1959), Pool (1959), Osgood (1959), and Shapiro (1997), Roberts (2000) concludes that in many instances text analysis involves both perspectives. It is representational when words for thematic categories are coded based on their face value, yet the researcher might use the data on these themes to interpret it instrumentally. On the second dimension of Roberts’ 2x3 taxonomy there are thematic, semantic, and network text analyses. The thematic approach is rooted in contingency analysis and involves counting themes belonging to a certain theoretical construct within text blocks. In the semantic text analysis, textual data is broken into specified semantic units, e.g., subject-action-object triplets, and every semantic unit is associated with a certain numerical sequence, which reflects the a priori established codes of themes found in each particular unit. Lastly, in the network analysis, text blocks are represented as networks of interrelated themes, and theme linkages are measured by specially generated variables. A quantitative content analysis, whether representational or instrumental, always produces a 2-dimentional data matrix suitable for further statistical analysis. The large volumes of digital textual data available and the repetitiveness of the task made the computer a natural and powerful choice for content analysis (Insch et al., 1997), despite the fact that not all nuances of the language can be recognized by any given software program. The review of selected content analysis software (16 programs) made by Alexa and Zuell (2000) concludes that all reviewed computer programs for textual data analysis have their strengths and weaknesses, and might not support certain operations associated with content analysis in an efficient and user-friendly manner. Alexa and Zuell (2000:318) argue that this lack of support is not a problem “if it were possible to use two or more different software packages for a single text analysis project in a seamless and user-friendly way.” METHODOLOGY The proposed approach facilitates content analysis of digital textual data using a combination of long-on-themarket CATPAC software (Woelfel, 1998) and the recently developed WORDER program (Kirilenko, 2004). As stated in the CATPAC manual, “CATPAC is a self-organizing artificial neural network that has been optimized for reading text. CATPAC is able to identify the most important words in a text and determine patterns of similarity based on the way they are used in text.” (Woelfel 1998:11). CATPAC can count the most frequently used words in a textual file and has been employed in content analysis of political speeches, focus groups interviews, marketing, and tourism-related research (Schmidt, 1998; Doerfel and Marsh, 2003; Kim et al., 2005). However, CATPAC analyzes only one textual file at a time. It is quite typical in content analysis projects that use web-based information, survey responses, newspaper articles, etc. to encounter a large number of separate files which have to be processed. WORDER software was specifically developed for such a purpose. During one run, WORDER is capable of parsing through up to 1,000 textual files, looking for up to 1,000 key words, and counting their occurrences in every file. The numerical matrix of key word frequencies is convenient to transfer to any statistical package. Both software programs complement each other and, used together, broaden possibilities for researchers working with textual data. Ultimately, the approach allows: 1) identification of destination image variables in the digital textual data using CATPAC, and 2) counting the occurrences of these variables in every textual file with WORDER.

In the first stage of the proposed two-step approach, CATPAC software is run on the pooled textual responses. The standard version of CATPAC can identify up to 160 most frequent meaningful words in a file. Some auxiliary words that do not add to the meaning, i.e., prepositions, articles, conjunctions, etc., are specified in CATPAC’s Exclude file and ignored. In the CATPAC output, the image variables are ranked in order of their frequencies; however, some words in the output might not belong to image variables (e.g., “year”, “often”). These words can be excluded from further analysis by adding them to the Exclude file before CATPAC is run again. Several iterations are usually enough to obtain the desired number of the most frequent image variables, the number of iterations being dependent on the variable frequency level at which a cut-off line has been placed. In the second stage, image variables identified by CATPAC are used as the input for WORDER software. They are specified for counting by the means of an input table. The other input for WORDER prior to counting is the list of all textual file names. When activated, WORDER counts every occurrence of every image variable in every textual file using the input table and the list of file names. The result is an output table where the rows are analyzed files and the columns are image variables. The table cells contain image variable frequencies counted in every textual file. The output table produced by WORDER can be easily transferred into SPSS for further statistical analysis and clustering purposes. While CATPAC has a clustering function, the program does not handle very well files of substantial size (pooled textual data is usually large in size): dendograms, which are supposed to show how the most frequent unique words cluster into meaningful concepts, look “like a mitten instead of a glove” (Woelfel 1998:25). In the case of a large text body, the CATPAC manual suggests experimenting with the procedure parameters. The proposed two-step approach of combining CATPAC and WORDER allows clustering of the image variables into themes of a more holistic nature by means of factor analysis, which will be demonstrated in the Application Examples section of this paper. Normally, a laborious “smoothing out” procedure should be performed on the textual data prior to using the CATPAC program; however, the necessary changes should concern only the meaningful words, or image variables. The important issues of content analysis that can undermine research results are illustrated using the example of the Stepchenkova and Morrison (2005) study of Russia’s induced image. Firstly, proper names can be misspelled in a number of ways, e.g., Saint-Petersburg versus Sankt-Petersburg. Even within a single response spelling may be inconsistent, and the issue is further amplified across all textual files. Every misspelling is counted as a different word by CATPAC, and therefore alters the real picture of image variable frequencies. Another source of research compromise is multi-word concepts; e.g., the phrase “Peter the Great” would be broken down by CATPAC into single words that would then be counted separately. In the final output given by CATPAC, it would be difficult to distinguish how many times the word “great” referred to Peter the Great, and how often it was counted with the meaning of “magnificent” or “splendid;” therefore, it is necessary to change multi-word concepts into a one-word format. Furthermore, to reinforce concepts, the researcher might wish to count synonyms, e.g., “monastery”, “cloister”, “convent”, and “abbey” as one word, and change nouns in the plural into singular form. Lastly, in open-ended responses, negatives are widely used to express a concept, e.g., “People are not particularly friendly” or “I do not feel safe there,” which also need to be taken into account while analyzing the textual data. All these issues are dealt with in the same manner by the proposed two-step approach. WORDER has a built-in function that allows to make changes in the data by means of the above-mentioned input table, simultaneously with the counting process. Image variables identified by CATPAC and their synonyms are placed in the input table row by row. WORDER replaces synonyms with the corresponding image variable placed in the very left cell of each row, and counts the number of times the variable occurs in the textual file. WORDER scans any given textual file as many times as there are rows in the table. Using the example provided by Table 1, during the first scan, whenever WORDER encounters words “performances,” “concert,” “concerts,” “show,” and “shows,” it replaces them with the word “performance” (capitalization does not matter) and counts them as such. On the second run, it does the same procedure with the image variable indicated by the second row, and so on until the entire table is exhausted. As a result, a new, “smoothed-out” file is created for every scanned file, and no changes are made by WORDER to the original data. Depending on the task, these smoothed-out files can be input into CATPAC for further analysis, therefore making the process iterative. Table 1 WORDER Input Table performance Peterhof Redsquare unsafe

performances Peterhoff Red square not safe

Concert Petergof

concerts Petergoff

Show

not feel safe

afraid

risky

shows

However, while the described “smoothing” function is a very convenient element of WORDER software, it is not the answer to all issues associated with the data clean-up process. In particular, it does not have a solution to the homographs problem, i.e., words with more than one meaning. If the identified image variables contain possible homographs, the only way to determine the intended meaning is to scan the original data for all occurrences of the word (Insch and Moore, 1997) with some kind of a search function, e.g., the one provided by MS Word. Negatives are also difficult to deal with, because the negation and the actual word can be separated by a large number of other words, e.g., “I don’t think that I would feel safe there.” “Negative” image variables, i.e., “unsafe,” “unfriendly,” etc., should be carefully considered, and special care given to constructing the corresponding entries in the WORDER input table. APPLICATION EXAMPLES Example 1. Russia’s stereotypical holistic images Let us illustrate the proposed methodology with regard to the study of Russia’s destination image (Stepchenkova, 2005), which followed the conceptual framework suggested by Echtner and Ritchie (1993). To get insights into Russia’s stereotypical holistic images, 317 textual responses obtained in Russia’s Destination Image online survey (Stepchenkova, 2005) as answers to Echtner’s and Ritchie’s question: “What images or characteristics come to mind when you think of Russia as a travel destination?” were analyzed. Following the two-step approach, a list of 72 most frequent meaningful words was obtained using CATPAC. All frequencies were 5 or higher. Some words, e.g., “history”, “historic”, “historical” or “large”, “big”, were grouped together under the most frequent name, in these cases “history” and “large”, to reinforce concepts, and substitutions in the data were made by WORDER. This reduced the list to 45 stereotypical image variables. Second, the frequencies of every stereotypical image variable were counted in every response using WORDER, and the results were transferred into SPSS database. Table 2 contains overall frequencies of Russia’s stereotypical image variables. Table 2 Russia’s Stereotypical Image Variables Variable cold beautiful people history buildings poor architecture Red Square St. Petersburg Moscow country old

Freq 69 55 54 45 39 38 37 36 34 30 28 25

Variable Kremlin palaces weather museums churches cities large interesting onion art great vast

Freq 24 23 19 19 19 18 15 13 13 13 12 12

Variable food culture friendly domes countryside snow Hermitage music winter dark different places

Freq 12 12 12 10 10 9 9 9 9 8 8 7

Variable orthodox open vodka exotic sites Volga river spaces ballet

Freq 7 7 6 6 6 5 5 5 5

The next step was to reduce the number of stereotypical image variables to a smaller number of image concepts by means of factor analysis. The data matrix, which was obtained by WORDER, had 317 cases and 45 variables, which gave a solid case to variable ratio of 7.04. The correlation matrix was found to be factorable with Bartlett’s Test of Sphericity at the p < 0.0001 significance level, and the KMO statistic of sampling adequacy of 0.529. Principal Components Analysis with Varimax rotation was used. Since textual responses were generally very short, e.g., “Cold. Beautiful churches,” it was decided to look for stable word combinations, which might include as few as 2 words, rather than for full 3-5 word factors. Therefore, the number of factors was not specified, and the option “eigen values larger than 1” was chosen. Weak items (“dark”, “interesting”, and “exotic”) with low coefficients in the diagonal of the anti-image matrix (