Part Three. Quantitative Data.

Introduction

In this session we'll be looking at the techniques used to carry out corpus analysis. We'll re-examine Chomsky's argument that corpus linguistics will result in skewed data, and see the procedures used to ensure that a representative sample is obtained. We'll also be looking at the relationship between quantitative and qualitative research. Although the majority of this session is concerned with statisitical procedures which can be said to be quantitative, it is important not to ignore the importance of qualitative analyses.

With the statistical part of this session two points should be made.

First, that this section is of necessity incomplete. Space precludes the coverage of all of the techniques which can be used on corpus data.
Second, we do not aim here to provide a "step-by-step" guide to statistics. Many of the techniques used are very complex and to explain the mathematics in full would require a separate session for each one. Other books, notably Language and Computers and Statistics for Corpus Linguistics (Oakes, M. - forthcoming) present these methods in more detail than we can give here.

Tony McEnery, Andrew Wilson, Paul Baker.

Qualitative vs Quantitative analysis

Corpus analysis can be broadly categorised as consisting of qualitative and quantitative analysis. In this section we'll look at both types and see the pros and cons associated with each. You should bear in mind that these two types of data analysis form different, but not necessary incompatible perspectives on corpus data.

Qualitative analysis: Richness and Precision.

The aim of qualitative analysis is a complete, detailed description. No attempt is made to assign frequencies to the linguistic features which are identified in the data, and rare phenomena receives (or should receive) the same amount of attention as more frequent phenomena. Qualitative analysis allows for fine distinctions to be drawn because it is not necessary to shoehorn the data into a finite number of classifications. Ambiguities, which are inherent in human language, can be recognised in the analysis. For example, the word "red" could be used in a corpus to signify the colour red, or as a political cateogorisation (e.g. socialism or communism). In a qualitative analysis both senses of red in the phrase "the red flag" could be recognised.

The main disadvantage of qualitative approaches to corpus analysis is that their findings can not be extended to wider populations with the same degree of certainty that quantitative analyses can. This is because the findings of the research are not tested to discover whether they are statistically significant or due to chance.

Quantitative analysis: Statistically reliable and generalisable results.

In quantitative research we classify features, count them, and even construct more complex statistical models in an attempt to explain what is observed. Findings can be generalised to a larger population, and direct comparisons can be made between two corpora, so long as valid sampling and significance techniques have been used. Thus, quantitative analysis allows us to discover which phenomena are likely to be genuine reflections of the behaviour of a language or variety, and which are merely chance occurences. The more basic task of just looking at a single language variety allows one to get a precise picture of the frequency and rarity of particular phenomena, and thus their relative normality or abnomrality.

However, the picture of the data which emerges from quantitative analysis is less rich than that obtained from qualitative analysis. For statistical purposes, classifications have to be of the hard-and-fast (so-called "Aristotelian" type). An item either belongs to class x or it doesn't. So in the above example about the phrase "the red flag" we would have to decide whether to classify "red" as "politics" or "colour". As can be seen, many linguistic terms and phenomena do not therefore belong to simple, single categories: rather they are more consistent with the recent notion of "fuzzy sets" as in the red example. Quantatitive analysis is therefore an idealisation of the data in some cases. Also, quantatitve analysis tends to sideline rare occurences. To ensure that certain statistical tests (such as chi-squared) provide reliable results, it is essential that minimum frequencies are obtained - meaning that categories may have to be collapsed into one another resulting in a loss of data richness.

A recent trend

From this brief discussion it can be appreciated that both qualitative and quantitative analyses have something to contribute to corpus study. There has been a recent move in social science towards multi-method approaches which tend to reject the narrow analytical paradigms in favour of the breadth of information which the use of more than one method may provide. In any case, as Schmied (1993) notes, a stage of qualitative research is often a precursor for quantitative analysis, since before linguistic phenomena can be classified and counted, the categories for classification must first be identified. Schmied demonstrates that corpus linguistics could benefit as much as any field from multi-method research.

Corpus Representativeness

As we saw in Session One, Chomsky criticised corpus data as being only a small sample of a large and potentially infinite population, and that it would therefore be skewed and hence unrepresentative of the population as a whole. This is a valid criticism, and it applied not just to corpus linguistics but to any form of scientific investigation which is based on sampling. However, the picture is not as drastic as it first appears, as there are many safeguards which may be applied in sampling to ensure maximum representativeness.

First, it must be noted that at the time of Chomsky's criticisms, corpus collection and analysis was a long and pains-taking task, carried out by hand, with the result that the finished corpus had to be of a manageable size for hand analysis. Although size is not a guarantee of representativeness, it does enter significantly into the factors which must be considered in the production of a maximally representative corpus. Thus, Chomsky's criticisms were at least partly true at the time of those early corpora. However, today we have powerful computers which can store and manipulate many millions of words. The issue of size is no longer the problem that it used to be.

Random sampling techniques are standard to many areas of science and social science, and these same techniques are also used in corpus building. But there are additional caveats which the corpus builder must be aware of.

Biber (1993) emphasises that we need to define as clearly as possible the limits of the population which we wish to study, before we can define sampling procedures for it. This means that we must rigourously define our sampling frame - the entire population of texts from which we take our samples. One way to do this is to use a comprehensive bibliographical index - this was the approach taken by the Lancaster-Oslo/Bergen corpus who used the British National Bibliography and Willing's Press Guide as their indices. Another approach could be to define the sampling frame as being all the books and periodicals in a particular library which refer to your particular area of interest. For example, all the German-language books in Lancaster University library that were published in 1993. This approach is one which was used in building the Brown corpus.

Read about a different kind approach, which was used in collecting the spoken parts of the British National Corpus, in Corpus Linguistics, chapter 3, page 65.

Biber (1993) also points out the advantage of determining beforehand the hierarchical structure (or strata) of the population. This refers to defining the different genres, channels etc. that it is made up if. For example, written German could be made up of genres such as:

newspaper reporting
romantic fiction
legal statutes
scientific writing
poetry
and so on....

Stratificational sampling is never less representative than pure probablistic sampling, and is often more representative, as it allows each individual stratum to be subjected to probablistic sampling. However, these strata (like corpus annotation) are an act of interpretation on the part of the corpus builder and others may argue that genres are not naturally inherent within a language. Genre groupings have a lot to do with the theoretical perspective of the linguist who is carrying out the stratification.

Read about optimal lengths and number of sample sizes, and the problems of using standard statistical equations to determine these figures in Corpus Linguistics, chapter 3, page 66.

Frequency Counts

This is the most straight-forward approach to working with quantitative data. Items are classified according to a particular scheme and an arithmetical count is made of the number of items (or tokens) within the text which belong to each classification (or type) in the scheme.

For instance, we might set up a classification scheme to look at the frequency of the four major parts of speech: noun, verb, adjective and adverb. These four classes would constitute our types. Another example inolves the simple one-to-one mapping of form onto classification. In other words, we count the number of times each word appears in the corpus, resulting in a list which might look something like:

abandon: 5
abandoned: 3
abandons: 2
ability: 5
able: 28
about: 128
etc.....

More often, however, the use of a classification scheme implies a deliberate act of categorisation on the part of the investigator. Even in the case of word frequency analysis, variant forms of the same lexeme may be lemmatised before a frequency count is made. For instance, in the example above, abandon, abandons and abandoned might all be classed as the lexeme ABANDON. Very often the classification scheme used will correspond to the type of linguistic annotation which will have already been introduced into the corpus at some earlier stage (see Session 2). An example of this might be an analysis of the incidence of different parts of speech in a corpus which had already been part-of-speech tagged.

Working with Proportions

Frequency counts are useful, but they have certain disadvantages. When one wishes to compare one data set with another, for example a corpus of spoken language with a corpus of written language. Frequency counts simply give the number of occurences of each type, they do not indicate the prevalence of a type in terms of a proportion of the total number of tokens in the text. This is not a problem when the two corpora that are being compared are of the same size, but when they are of different sizes frequency counts are little more than useless. The following example compares two such corpora, looking at the frequency of the word boot

Type of corpus	Number of words	Number of instances of boot
English Spoken	50,000	50
English Written	500,000	500

A brief look at the table seems to show that boot is more frequent in written rather than spoken English. However, if we calulate the frequency of occurrence of boot as a percentage of the total number of tokens in the corpus (the total size of the corpus) we get:

spoken English: 50/50,000 X 100 = 0.1%
written English: 500/500,000 X 100 = 0.1%

Looking at these figures it can be seen that the frequency of boot in our made-up example is the same (0.1%) for both the written and spoken corpora.

Even where disparity of size is not an issue, it is often better to use proportional statistics to present frequencies, since most people find them easier to understand than comparing fractions of unusual numbers like 53,000. The most basic way to calculate the ratio between the size of the sample and the number of occurences of the type under investigation is:

ratio = number of occurrences of the type / number of tokens in the entire sample

This result can be expressed as a fraction, or more commonly as a decimal. However, if that results in an unwieldy looking small number (in the above example it would be 0.0001) the ratio can then be multiplied by 100 and represented as a percentage.

Significance Testing

Significance tests allow us to determine whether or not a finding is the result of a genuine difference between two (or more) items, or whether it is just due to chance. For example, suppose we are examining the Latin versions of the Gospel of Matthew and the Gospel of John and we are looking at how third person singular speech is represented. Specifically we want to compare how often the present tense form of the verb "to say" is used ("dicit") with how often the perfect form of the verb is used ("dixit"). A simple count of the two verb forms in each text produces the following results:

Text	no. of occurences of dicit	no. of occurences of dixit
Matthew	46	107
John	118	119

From these figures is looks as if John uses the present form ("dicit") proportionally more often than Matthew does, but to be more certain that this is not just due to co-incidence, we need to perform a further calculation - the significance test.

There are several types of significance test available to the corpus linguist: the chi squared test, the [Student's] t-test, Wilcoxon's rank sum test and so on. Here we will only examine the chi-squared test as it is the most commonly used significance test in corpus linguistics. This is a non-parametric test which is easy to calculate, even without a computer statistics package, and can be used with data in 2 X 2 tables, such as the example above. However, it should be noted that the chi-squared test is unreliable where very small numbers are involved and should not therefore be used in such cases. Also, proportional data (percentages etc) can not be used with the chi-squared test.

The test compares the difference between the actual frequencies (the observed frequencies in the data) with those which one would expect if no factor other than chance had been operating (the expected frequencies). The closer these two results are to each other, the greater the probablity that the observed frequencies are influenced by chance alone.

Having calculated the chi-squared value (we will omit this here and assume it has been done with a computer statistical package) we must look in a set of statistical tables to see how significant our chi-squared value is (usually this is also carried out automatically by computer). We also need one further value - the number of degrees of freedom which is simply:

(number of columns in the frequency table - 1) x (number of rows in the frequency table - 1)
In the example above this is equal to (2-1) x (2-1) = 1.

We then look at the table of chi-square values in the row for the relevant number of degrees of freedom until we find the nearest chi-square value to the one which is calculated, and read off the probability value for that column. The closer to 0 the value, the more significant the difference is - i.e. the more unlikely that it is due to chance alone. A value close to 1 means that the difference is almost certainly due to chance. In practice it is normal to assign a cut-off point which is taken to be the difference between a significant result and an "insignificant" result. This is usually taken to be 0.05 (probablity values of less than 0.05 are written as "p <>

In our example about the use of dicit and dixit above we calculate a chi-squared value of 14.843. The table below shows the significant p values for the first 3 degrees of freedom:

Degrees of Freedom	p = 0.05	p = 0.01	p = 0.001
1	3.84	6.63	10.83
2	5.99	9.21	13.82
3	7.81	11.34	16.27

The number of degrees of freedom in our example is 1, and our result is higher than 10.83 (see the final column in the table) so the probability value for this chi-square value is 0.001. Thus, the difference between Matthew and John can be said to be significant at p <>

In depth: You can also read about Type I and Type II errors in the glossary.

Collocations

The idea of collocations is an important one to many areas of linguistics. Khellmer (1991) has argued that our mental lexicon is made up not only of single words, but also of larger phraseological units, both fixed and more variable. Information about collocations is important for dictionary writing, natural language processing and language teaching. However, it is not easy to determine which co-occurences are significant collocations, especially if one is not a native speaker of a language or language variety.

Given a text corpus it is possible to empirically determine which pairs of words have a substantial amount of "glue" between them. Two of the most commonly encountered formulae are: mutual information and the Z-score. Both tests provide similar data, comparing the probablities that two words occur together as a joint event (i.e. because they belong together) with the probability that they are simply the result of chance. For example, the words riding and boots may occur as a joint event by reason of their belonging to the same multiword unit (riding boots) while the words formula and borrowed may simply occur because of a one-off juxtaposition and have no special relationship. For each pair of words, a score is given - the higher the score the greater the degree of collocality.

Mutual information and the Z-score are useful in the following ways:

They enable us to extract multiword units from corpus data, which can be used in lexicography and particularly specialist technical translation.
We can group similar collocates of words together to help to identify different senses of the word. For example, bank might collocate with words such as river, indicating the landscape sense of the word, and with words like investment indicating the financial use of the word.
We can discriminate the differences in usage between words which are similar. For example, Church et al (1991) looked at collocations of strong and powerful in a corpus of press reports. Although these two words have similar meanings, their mutual information scores for associations with other words revealed interesting differences. Strong collocated with northerly, showings, believer, currents, supporter and odor, while powerful collocated with words such as tool, minority, neighbour, symbol, figure, weapon and post. Such information about the delicate differences in collocation between the two words has a potentially important role, for example in helping students who learn English as a foreign language.

Read about the use of mutual information in parallel aligned corpora in Corpus Linguistics, Chapter 3, page 73.

Multiple Variables

The tests that we have looked at so far can only pick up differences between particular samples (i.e. texts and copora) on particular variables (i.e. linguistic features) but they cannot provide a picture of the complex interrelationship of similarity and difference between a large number of samples, and large numbers of variables. To perform such comparisons we need to consider multivariate techniques. Those most commonly encountered in linguistic research are:

factor analysis
principal components analysis
multidimensional scaling
cluster analysis

The aim of multivariate techniques is to summarise a large set of variables in terms of a smaller set on the basis of statistical similarities between the original variables, whilst at the same time losing the minimal amount of information about their differences.

Although we will not attempt to explain the complex mathematics behind these techniques, it is worth taking time to understand the stages by which they work: All the techniques begin with a basic cross-tabulation of the variables and samples.

For factor analysis an intercorrelation matrix is then calculated from the cross-tabulation, which is used to attempt to "summarise" the similarities between the variables in terms of a smaller number of reference factors which the technique extracts. The hypothesis being that the many variables which appear in the original frequency cross-tabulation are in fact masking a smaller number of variables (the factors) which can help explain better why the observed frequency differences occur.

Each variable receives a loading on each of the factors which are extracted, signifying its closeness to that factor. For example, in analysing a set of word frequencies across several texts one might find that words in a certain conceptual field (i.e. religion) received high loadings on one factor, whereas those in another field (e.g. government) loaded highly on another factor.

Follow this link for an example of factor analysis.

Correspondence analysis is similar to factor analysis, but it differs in the basis of its calculations.

Multidimensional scaling (MDS) also makes use of an intercorrelation matrix, which is then converted to a matrix in which the correlation coefficients are replaced with rank order values. E.g. the highest correlation value recieves a rank order of 1, the next highest receives a rank order of 2 and so on. MDS then attempts to plot and arrange these variables so that the more closely related items are plotted closer together than the less closely related items.

Cluster analysis involves assembling the variables into unique groups or "clusters" of similar items. A matrix is created, in a similar fashion to factor analysis (although this may be a distance matrix showing the degree of difference rather than similarity between the pairs of variables in the cross-tabulation). The matrix is then used to group the variables contained within it.

Read more about cluster analysis in Corpus Linguistics, Chapter 3, pages 76, 78 and 79.

Log-linear Models

Here we will consider a different technique which deals with the interrelationships of several variables. As linguists, we often want to go beyond the simple description of a phenomenon, and explain what it is that causes the data to behave in a particular way. A loglinear analysis allows us to take a standard frequency cross-tabulation and find out which variables seem statistically most likely to be responsible for a particular effect.

For example, let us imagine that we are interested in the factors which influence whether the word for is present or omitted from phrases of duration such as She studied [for] three years in Munich. We may hypothesise several factors which could have an effect on this, e.g. the text genre, the semantic category of the main verb and whether or not the verb is separated by an adverb from the phrase of duration. Any one of these factors might be solely responsible for the omission of for, or it might be the case that a combination of factors are culpable. Finally, all the factors working together could be responsible for the presence/omission of for. A loglinear analysis provides us with a number of models which take these points into account.

The way that we test the models in loglinear analysis is first to test the significance of associations in the most complex model - that is the model which assumes that all of the variables are working together. Then we take away each variable at a time from the model and see whether significance is maintained in each case, until we reach the model with the lowest possible dimensions. So in the above example, we would start with a model that posited three variables (e.g. genre, verb class and adverb separation) and test the significance of a three variable model. Then we would test each of the two variable models (taking away one variable in each case) and finally each of the three one-variable models. The best model would be taken to be the one with the fewest number of variables which still retained statistical significance.

Read about variable rule analysis and probabilistic language modelling in Corpus Linguistics, Chapter 3, pages 83-84.

Conclusion

In this section we have -

Discussed the roles of qualitative and quantitative analysis
Examined the notion of a representative corpus
Looked at frequency counts and the importance of proportionally representative data
Considered statistical significace testing and looked at a number of statistical tests that can be carried out on corpora; namely: collocation, factor analysis and loglinear models.

posted under |

1 komentar:

Statswork mengatakan...: Hai…you have posted great article, it really helpful to us.. I will refer this page to my friends; I hope they will like to read.
Quantitative Data Analysis; 6 Mei 2015 pukul 15.36

Ms. Santi's Zone