Chapter 5      The Methodological Background:

                      British Traditions of Text Analysis,

                           Correlative Register Analysis and

                           Corpus Linguistics


5.1 Introduction


In looking at both Business English and lexis in the previous chapters, the methodological base on which this research stands has been referred to, but so far has not been fully laid out. One problem with a work of this nature is that it does not fit neatly into one area - in examining Business English the fields of corpus linguistics, lexis, collocation, colligation, multi-word items, and studies into register, discourse and genre analysis are all touched upon. Yet despite the apparent diversity of the areas of study in this thesis, certain underlying methodological principles are at work throughout. This thesis is firmly embedded in British traditions of text analysis as set out by Stubbs (1993,1996),[1] following along the lines of  J.R. Firth, M.A.K. Halliday and John Sinclair in particular. This chapter, consequently, is divided into two parts.


·    Firstly, this chapter will look at how this thesis is placed in the context of British text linguistics following the main principles laid out by Stubbs (1993, 1996). Stubbs suggested nine main principles of the British tradition of text analysis, of which eight will be utilised here. In each case Stubbs’ principles will be briefly elucidated, followed by an explanation of how this thesis relates to it. 


·    Secondly, by linking this thesis to principles of British text analysis in the tradition of Firth and especially Sinclair, a further important link is forged: that of the necessity of using corpus-based methodology in order to investigate language. This chapter will also, therefore, lay out the methodological reasoning behind the choice of the use of corpora. This will lead to the next chapter, where the two corpora created for this study - the Business English Corpus and the Published Materials Corpus - will be presented, in relation to key issues of corpus creation.


5.2 British traditions in text analysis: Firth, Halliday and Sinclair


The article by Stubbs, on which this part of the chapter is based, mostly covers British traditions in text analysis from Firth onwards. While there are references to the early work of Firth in the 1930s, Stubbs’ main focus is a contrast between the work of British linguists in the tradition of Firth, to the conceptualisation of language as put forward by Chomsky. Although Halliday and Sinclair were working within the same time frame as that of Chomsky - 1960s to the 1990s - they represent a very different view of language and language research and these differences are now discussed in reference to the nine points made by Stubbs.


5.2.1 Principle 1:  Linguistics is essentially a social science and an applied science


The first difference between the Chomskyan and British schools of thought discussed by Stubbs is related to views on linguistics itself. Chomsky saw linguistics as a branch of cognitive psychology, whereas Firth, Halliday and Sinclair saw it as an applied social science (Stubbs 1993:3). Thus for the British school, linguistics should be seen in a social context, and although this view holds that ‘social scientific study need not have any practical applications’  (Stubbs 1993:4), in practice, much of the British work has had an applied element. Stubbs notes that ‘Firth describes his work as essentially sociological’ (1993:4) and the work of Sinclair - notably the COBUILD project - has led to pedagogical grammars, dictionaries and teaching materials. Stubbs notes that Halliday, in his work on register, has also formed a basically socially-based definition of it. Language study, then, in effect, should be related to applied issues and not be seen as ‘work divorced from all social relevance’ (1993:4).


The view taken in this thesis is that research should have a direct applied element and the results of the research should be able to be utilised directly in the classroom. The focus in this research on the key lexical items of Business English aims at creating a core lexis of Business English that can serve as a bank of information for both teachers and students alike. This information is stored electronically which, additionally, facilitates easy access and retrieval. As a result of this easy access, an easier transfer of data to teaching materials can also be achieved. The use of semantic prosody as an organising element for collocations and multi-word items adds a further educational element to the study - a framework is provided for students by which the complexities of collocation can be organised. Examples of teaching materials already created from the corpus can be found in Appendix 11 in Vol. II, p.891. 


5.2.2 Principle 2: Language should be studied in actual, attested, authentic instances of use, not as intuitive, invented, isolated sentences


It is clear from this second principle that it is an attack on the Chomskyan rationalist view of language, where the focus of study was ‘intuitive, invented, isolated sentences’. Thus for Chomsky, isolated invented sentences were studied and advanced as credible data to put forward his theories of language. These sentences were created by the researcher and did not come from any actual objective data. The Firthian tradition takes an opposing view. For Firth, language could not be studied as isolated sentences and was seen as contextual. He noted in 1957 that ‘The text is the focus of attention ... is regarded as an integral part of the context, and is observed in relation to the other parts regarded as relevant in the statement of the context’ (Firth 1957:175-176). He continued, ‘The placing of a text as a constituent in a context of situation contributes to the statement of meaning since situations are set up to recognize use’ (1957:179). Thus, for Firth, language derives its meaning from context and cannot be seen out of it. Consequently, as language is contextual, according to this view Chomsky’s methods are invalidated.


The use of introspection and intuitive data - a key factor in Chomsky’s approach to linguistics - is strongly criticised by Stubbs: ‘One does not expect a scientist to make up the data at the same time as the theory, or even to make up the data afterwards, in order to illustrate the theory’ (Stubbs 1996:29). Stubbs’ views are echoed by Sinclair: ‘human intuition about language is highly specific, and not at all a good guide to what actually happens when the same people actually use the language’ (Sinclair 1991:4). More bitingly, perhaps, he commented that ‘One does not study all of botany by making artificial flowers’ (1991:6).


If intuitive data is invalid, data must, therefore, come from an outside source. This outside source, Stubbs suggests, should be a large corpus of language: ‘A large corpus, consisting of at least several million words, searched with computer assistance, provides a way out of this dilemma’ (Stubbs 1996:32).


This thesis adopts the views of Sinclair and Stubbs on the use of authentic data over intuitive and introspective. The two corpora created for this study, though not large,[2] provide all the data for analysis - no data is of an introspective nature. Intuition has been used only where categorisation of language is necessary and empirical methods are not available.


5.2.3 Principle 3: The unit of study must be whole texts


In terms of corpus creation there have been two basic methods for determining corpus content. The first is to take, for example, 2,000-5,000 word extracts from a variety of pre-determined texts to hopefully provide a ‘balanced’ sample. The second is to use whole texts. Early corpora of limited size used the former method, but with the rise of larger corpora the focus has been more on using full texts. Choice of full texts also has a significant impact on the kind of data that can be studied. Stubbs notes that ‘few linguistic features of a text are distributed evenly throughout’ (1996:32) with the result that the use of only a small ‘sample’ of given text will inevitably miss out a great many features present. This is especially important when studying genre. Studies into genre have noted how certain linguistic features are typical of certain parts of a text and an approach to corpus creation that only takes extracts at random will fail to gain a representative sample in this respect:


... a corpus which does not reflect the size and shape of the documents from which it is drawn is in danger of being seen as a collection of fragments where only small-scale patterns are accessible. 

                                                                                    (Sinclair 1991:19)


The idea of using the category of whole text as the starting point for analysis of language as opposed to the word was discussed by Scott (1997). Scott compared this approach with other corpus analytical methods - notably the word-and-collocation-span model  (Scott 1997:235). This latter approach starts from an analysis of a node word and centres on the collocation of words within a pre-determined span of it. Scott’s approach here was essentially different. Rather than starting from a word and seeing how it behaves, the key word statistic is determined by comparing a whole text to a reference corpus. The software used to perform the analysis - WordSmith - computes key words that occur significantly more often in the whole text under analysis than could be expected on the basis of the distribution of words in the reference corpus. In this way Scott was able to ‘characterise texts, and ... develop means for drawing inferences regarding the culture these texts spring from’ (Scott 1997:235). Tribble (1998) adopted the same method in his genre study of Phare project proposals and notes that this full text approach ‘has immediate and significant advantages for anyone with an interest in genres and the whole area of language in a social context’ (Tribble 1998:5).


In line with the above discussion the BEC has been gathered entirely from full texts.[3] The methods used by Scott (1997) - who analysed language in newspaper articles - are used in this thesis to gain key words not just for a given genre, but for Business English as a whole. The full range of lexis and the appropriate distribution of words in Business English could not be achieved without the use of full texts. The key words then provide a platform for more detailed analysis that takes place at the level of the word, using the word/collocation span model for collocational analysis. It is important to stress here, however, that this collocational analysis takes place on words that come from full texts and not isolated samples of language, thus placing them in a full and complete lexical environment.


5.2.4  Principle 4: Texts and text types must be studied comparatively across text corpora


Firth believed in the heterogeneity of language, that the concepts of unity and language are incompatible and that there is an ‘inherent variability of language’ (Stubbs 1996:33). This variability is shown in the variety of different registers and genres of language. Both Halliday (register analysis) and Sinclair (discourse analysis) have adapted this basic standpoint to all their work. Stubbs notes that certain factors can hide the inherent variance of language - the most important of which being the use of introspective data. This point is thus in harmony with Principle 2, which noted that all data should come from authentic attested situations and not be of the introspective kind.


In discussing the variability of language, Stubbs refers to the work of Biber (1988) (see also for example, Biber & Finegan 1989, 1994 and Biber 1995). Biber’s work has concentrated on linguistic variation in genres and is thus founded in earlier traditions of statistical correlative register analysis. Biber was able to place a variety of genres on several clines that mark out their linguistically distinguishing features in relation to six ‘text dimensions’. Biber’s work is important in that this was done according to purely quantitative criteria.[4]

This study is grounded both in the tradition of Firthian British text linguistics and also in that of correlative register analysis. It is therefore accepted that a basic definition of language is that it varies according to register. At the same time, account is taken of genre in terms of their organising function in the make-up of the BEC. Thus, the assumptions of the nature of language that this thesis rests on are that a) language is inherently diverse according to the purpose[5] and situation it is being used in and that b) the only way to investigate that is to gather authentic data from the situations where this language is in actual use. The departure point for language study in this work is the two corpora - the Business English published materials corpus (PMC) and the authentic Business English corpus (BEC). Using these two corpora, the heterogeneity of language is shown firstly by contrasting ‘Business English’ to ‘General English’ to study inherent differences in lexical choice and, secondly, by a contrastive study of ‘real’ Business English and the business language found in Business English teaching materials.


5.2.5 Principle 5: Linguistics is concerned with the study of meaning: form and meaning are inseparable


Stubbs here quotes Chomsky (1957:17): ‘grammar is autonomous and independent of meaning’ (cited in Stubbs 1996:35). This is exactly the opposite view expressed by Sinclair (1991), where he states that  ‘There is ultimately no distinction between form and meaning’ (Sinclair 1991:7), thus expressing views held in the British tradition of text analysis on the interdependency of form and meaning brought about by Firthian definitions of collocation. A large part of the importance of collocational analysis in British corpus linguistics over the last twenty years has developed from the Firthian definition of words being at least partially defined by the other words with which they can collocate (Firth 1951/57). There is thus a ‘syntagmatic link between words as such, not between categories’ (Stubbs 1996:35). This thesis studies the semantic links between words/multi-word items and the other words that surround them in a syntagmatic relationship, and this is also related to form. The syntagmatic relationships formed are then further categorised using the notion of semantic prosody. However, when discussing the relationship of form and meaning, it is not possible to do it fully without reference to the grammar/lexis divide. Stubbs’ next section deals with just this dichotomy, or, in fact, the lack of it. 


5.2.6 Principle 6: There is no boundary between lexis and grammar: lexis and grammar are interdependent


Traditionally grammar and lexis have been treated as separate and independent categories. In terms of collocation, we have seen Gitsaki (1996) define three major schools of thought: the lexical composition approach, the semantic approach and the structural approach.[6] The first two approaches, inspired by Firth, saw lexis and grammar as separate, whilst the latter approach saw them as co-joined. Later work by Sinclair (1991), Willis (1993) Hunston et al. (1997), Hunston & Francis (1998), Hoey (1997, 2000) and indeed Stubbs (1993, 1996), sees lexis and grammar as dependent on each other and interrelated. Stubbs elucidates the Principle of co-selection: lexis chooses grammar and grammar chooses lexis: ‘What corpus study shows is that lexis and syntax are totally interdependent. Not only different words, but different forms of a single lemma, have different grammatical distributions’ (Stubbs 1996:38). Willis (1993) suggests that rather than seeing grammar and lexis as separate, the starting point should be the ‘word’[7] and that the traditional concepts of grammar should be broadened to consider the grammar of structure, necessary choice, class, collocation and probability (Willis 1993:84-85). Later studies, it has been shown, (e.g. Hunston et al. 1997, Hunston & Francis 1998) indeed present an even stronger case for this.[8] Stubbs presents a list of eight of the central conclusions that can be made about lexico-grammatical relationships. Two key points are given below:


1.         Any grammatical structure restricts the lexis that occurs in it; and conversely

            any lexical item can be specified in terms of the structures in which it occurs.

            (Stubbs 1996:40)

2.         Every sense or meaning of a word has its own grammar: each meaning is

            associated with a distinct formal patterning. Form and meaning are

            inseparable. (Stubbs 1996:40).


This thesis accepts the principle of lexico-grammatical relationship as presented by Stubbs and the later work of Sinclair. In so doing it utilises the concepts of collocation and colligation to examine the typical lexico-grammatical formation of business lexis and it will be seen in Chapter 9 how lexis and grammar in Business English are intertwined.


5.2.7 Principle 7: Much language use is routine


The basic concept under discussion here is that when speaking we are not free to say what we like, but are bounded by certain possibilities and restrictions over and above the normally recognised grammatical categories. Each spoken or written act to a large extent determines the next one. Thus, language is made up of a large number of lexical items that are repeated over and over again in everyday situations in terms of individual words, collocations and multi-word items (Peters 1983, Pawley & Syder 1983, Widdowson 1989, Sinclair 1991, Nattinger 1980, 1988, Nattinger & DeCarrico 1992, Lewis 1993, 1997). This is in contrast to Chomsky, who focused on creativity, and saw routine as negative. This thesis concentrates on the routine language of business providing an analysis of not only words, but also of collocates and multi-word items.


5.2.8 Principle 8: Language in use transmits the culture


Stubbs (1996) gives examples of how fixed and semi-fixed phrases are used to encode cultural information, e.g. a soft oriental rhythm came through entrancingly / that orientalized, barbarized nation (Stubbs 1996:169). The transmittal of culture through language, though of interest and importance, does not form part of this thesis so no further discussion is offered here.


5.2.9 Principle 9: Saussurian dualisms are misconceived


It is obvious from what has been written that the ideas of Firth, and later Sinclair and Halliday were in direct opposition to Chomsky’s competence/performance dichotomy and by implication also de Saussure’s langue/parole distinction, i.e. that there is no need for these distinctions at all. It has been seen that whilst not all writers reject Chomsky, (Pawley & Syder 1983, Nattinger & DeCarrico 1992), many of those involved in corpus linguistics, notably Sinclair, do. Stubbs notes that


The essential vision underlying corpus linguistics is that computer-assisted analysis of language gives access to data which were previously unobservable, but which can now profoundly change our understanding of language.                                      (Stubbs 1996:45-46)


Sinclair, relying on computer-based methods, had used this fact to put forward the idiom principle (Sinclair 1991) and a damning criticism of Chomsky’s (1957, 1962) ideas. Computer-assisted methodology also enabled Sinclair (1991) and Louw (1993) to formulate the concept of semantic prosody that had hitherto been unrecognised by rationalist approaches to language study.


This thesis is concerned with data of language in use and uses corpora in order to investigate the language of Business English. Chomsky (1962) rejected the use of corpora and was not interested in language as such - only idealised notions of grammatical competence detached from real life. Therefore, the Chomskyan notion of competence and performance is of little use in a study of this nature, and instead, the open/idiom principle view of language put forward by Sinclair (1991) as laid out in Chapter 4 is adopted. This methodological framework allows for a study of routine language, it uses actual language, is firmly based on authentic situations and is stored in a computerised format. This further facilitates an easy transfer of results to the classroom once they have been ascertained.


A further implication of the above is that for language to be studied along the lines of Sinclair’s idiom principle, it can only be done using computerised corpora. This thesis is grounded in two computerised corpora and so the next section, and indeed the next chapter, will consider issues related to corpora in general, and the creation of the two corpora in particular.


 5.3 Corpus Linguistics


5.3.1 Corpora: a brief history


The term corpus, coming from the Latin word for ‘body’, was used as early as the 6th century to describe a collection of legal texts, Corpus Juris Civilis (Francis 1992:17). The term ‘corpus’ has retained this meaning - that of a body of text - but for corpus linguists this definition is not enough. By one of its five OED definitions, a corpus is ‘The body of written or spoken material upon which a linguistic analysis is based’. Thus it cannot be seen as just a collection of texts, but it further is ‘a collection of texts assumed to be representative of a given language, dialect, or other subset of a language, to be used for linguistic analysis’ (Francis 1982:7 cited in Francis 1992:17). Likewise, the Collins COBUILD (1995) dictionary defines a corpus as ‘a large collection of written or spoken texts that is used for language research’. 


Francis (1992) records three main areas where corpora have historically been used.[9] Corpora have been used mainly in lexicographical studies in the creation of dictionaries, in dialectological studies and in the creation of grammars. Kennedy (1998) adds to this work done in concordancing the Bible by Alexander Cruden in 1736.[10] The image of  silver-haired professors straining over mountains of text and manually counting occurrences of linguistic features has been a hard one to dispel.[11] One early example is mentioned by Kennedy (1992:335): the 1897 German corpus of Kaeding was created by five thousand assistants and consisted of 11 million words. This is, of course, small by today’s standards, but represented a massive achievement at the time. The use of corpora in linguistic research was considered perfectly acceptable in the first half of this century and the work carried out by Palmer, Thorndike and West on vocabulary noted in the previous chapter, all had corpora of texts underlying their results to a greater or lesser extent.


Yet there is a clear divide between early corpus linguists and their modern day counterparts. The first reason for this divide is the corpus methodology used. Kennedy (1992) notes several problems with these early corpora: they were mostly of written texts only, just the forms were counted, not the meanings,[12] and the corpora were untagged, so homonyms were often classed as one word. The second and main reason for this divide, however, was Chomsky. Leech (1991) notes of Chomsky that ‘His view on the inadequacy of corpora, and the adequacy of intuition, became the orthodoxy of a succeeding generation of theoretical linguists’ (1991:8).  It has already been noted in this thesis that in a number of articles in the late 1950s and 1960s, Chomsky challenged the whole notion of empiricism on which corpus linguistics had been  based and suggested instead a rationalist approach.[13] This approach advocated methodology where ‘rather than try and account for language observationally, one should try to account for language introspectively’ (McEnery & Wilson 1996:6). Chomsky attacked corpus-based studies by saying that ‘Any natural corpus will be skewed...the corpus, if natural, will be so wildly skewed that the description would be no more than a mere list’ (Chomsky 1962:159 cited in Leech 1991:8). As Chomsky was more interested in competence than performance, corpus linguistics, which was primarily based on actual performance data, seemed to be invalidated overnight. It led to a situation that Sinclair describes:


Starved of adequate data, linguistics languished - indeed it became almost totally introverted. It became fashionable to look inwards to the mind rather than outwards to society. Intuition was the key, and the similarity of language structure to various formal models was emphasised.                                                                  (Sinclair 1991:1)


Work on corpora continued despite these criticisms, however, and, with the advent of the computer, corpora really came into their own.[14] The work begun in the early sixties by Randolph Quirk (the SEU Corpus), and Francis and Kucera (The Brown Corpus) was capitalised on by Svartvik in creating the London-Lund Corpus (LLC), creating a machine-readable corpus of spoken language for the first time.[15] By the 1980s corpus linguistics had almost found its way back into mainstream applied linguistics. Leech (1991) distinguishes three generations of corpora going from the early one million word corpora to the present day where corpora can be measured in the hundreds of millions of words.[16] This rise of corpus-based research can be seen in the number of corpora-related studies carried out. Svartvik (1992:8) shows that whereas only ten corpus-based studies could be identified before 1965, between 1986 and 1992, 320 were carried out. There is a profound sense amongst corpus linguists these days that the use of corpora has ‘arrived’. Sinclair notes:


Thirty years ago when this research was started it was considered impossible to process texts of several million words in length. Twenty years ago it was considered marginally possible but lunatic. Ten years ago it was considered quite possible but still lunatic. Today it is very popular.                                                                        (Sinclair 1991:1)



The latest trends in corpus linguistics are discussed by Flowerdew (1998), who suggests that whereas earlier corpus studies were concerned with the exploration of linguistic patterns, modern studies are becoming more and more concerned with the exploitation of corpora for pedagogical purposes. She argues that more studies are needed that take into account aspects of discourse and genre and that whilst some work in this direction has been done (see, for example, Tribble 1998), there is still a need for more. Exploitation of corpora now also takes into account the creation of teaching materials (see Wichmann et al. 1997 for discussion on this), and this aspect will be covered in more detail in Chapter 9 of this thesis.


5.3.2 Why use corpora ?


When discussing the relationship of this thesis to traditions of British text analysis above, it was stressed that language study must make use of ‘actual, attested, authentic instances of use’. The thoughts of Stubbs and Sinclair were quoted at length to stress the necessity of this line of thinking. Belief in this approach to linguistic analysis leads automatically, therefore, to the use of corpora. The next section of this chapter, accordingly, examines the reasons why a corpus linguistic approach has been chosen for this work. Therefore, what follows is a brief discussion of the merits and possible pitfalls of using corpora for linguistic study, as seen in the literature.


5.3.3 Corpora: For and against


It is possible to derive three basic standpoints from the literature with regard to the use of corpora in linguistic analysis. These standpoints can be summarised as a) those strongly for the use of corpora; b) those for corpora, but with certain reservations; and c) those against their use altogether.  Such did the climate change in the 1990s that the utility of corpora for language analysis is no longer seriously questioned.[17] The days of the overwhelming influence of Chomsky have gone and the third alternative presented above is no longer tenable. With regard to the first two of the three categories, therefore, Murison-Bowie (1996) presents a neat definition:


The strong case suggests that without a corpus (or corpora) there is no meaningful work to be done. The weak case is that there are additional descriptive pedagogic perspectives facilitated by corpus-based work which improve our knowledge of the language and our ability to use it.                                                                       (Murison-Bowie 1996:182)


Writers tending towards the stronger end of this continuum include Stubbs and Sinclair,  but even most pro-corpus writers do not follow the creed of corpus linguistics blindly. They realise that whilst there are considerable advantages to be gained from corpora, there are also possibly negative aspects that need to be taken into consideration. The positive and negative aspects of corpora use, therefore, can be presented with reference to the literature as follows.


5.3.4 Reasons for the use of corpora in linguistic analysis


The advantages of corpora use were expounded over thirty years ago by Halliday and Sinclair, in 1966. At that time Halliday suggested the creation of a 20 million-word corpus for collocational analysis (Halliday 1966:159) and Sinclair declared that the problems of lexis ‘are not likely to yield to anything less imposing than a very large computer’ (Sinclair 1966:410). This enthusiasm for corpora has already been documented above, but other writers have presented compelling reasons for the use of corpora. It is important to note here, however, that the use of corpora and the use of computers to analyse them are held to be synonymous. Thus, some of the advantages stated below for the use of corpora are actually advantages brought about by the use of computer technology rather than those of corpora per se. This being said, at least ten main interrelated factors can be found in the literature that confirm the advantages of computerised corpora use in linguistic analysis:


1. Objectivity  vs  intuition: The notion of a researcher’s intuition as opposed to statistical  objectivity is raised once again (as it has been many times throughout this thesis). This point has already been discussed in Principle 2 above so does not need any further mention here, other than to note that the objective power of language corpora is well recognised in the literature (see Point 1 in Table VIII below) and that the advantages brought about by computerised corpora in linguistic research are overwhelming.  


2. Verifiability of results: In a related point, both Svartvik (1992) and Biber (1995) emphasise the importance of this factor. For results to have any meaning they must be able to be verified. Svartvik observes that verifiability is one of the main tenets of scientific research and so it should also be of linguistics. Corpora  offer the possibility of verifying results, whereas introspection does not. 


3. Broadness of language able to be represented: This factor is widely discussed in the literature, for example, (Svartvik 1992, Biber 1995, Biber, Conrad & Reppen 1994). Corpus linguistic methodology facilitates the gathering of samples of different registers and styles of language which are necessary to show the ‘wide repertoire of language’ (Svartvik 1992:9).


4. Access: Once a set of texts has been gathered and placed in a corpus it can be made available to researchers all over the world. Moreover, it provides non-native speakers with the same possibilities of study as native speakers. In the rationalist system, the non-native speaker had been excluded (Svartvik 1992).


5. Broad scope of analysis: Computerised corpus analysis allows a broad battery of statistical tests to be carried out on the data in a matter of seconds.


6. Pedagogic: There are strong pedagogic reasons (face validity, authenticity, motivation, for example) why the results of corpora research should be used in the classroom (Johns 1988, Tribble & Jones 1990, Kennedy 1992, Wichmann et al. 1997 and Flowerdew 1998). This is an area that will be returned to in more detail in the final chapters of this thesis.


7. Possibility of cumulative results: Biber (1995:32) notes that a corpus gives the opportunity to several researchers to work on the same texts. In this way previous work can be verified and the findings of the studies can be compared in a meaningful way.


8. Accountability: The possibility of verification of results leads to the accountability of the researchers. Thus, as in other areas of science, it is possible to replicate work done and hold the results up for comparison.


9. Reliability: The simple fact is that computers are much more reliable analysts of texts than humans. As Biber (1995) says ‘computers do not become bored or tired’ (1995:32). Additionally, the hard empirical evidence presented by a corpus of authentic texts can present indisputable evidence, for example of the frequency of particular items, that introspection is not able to do. This view of research is neatly summed up by Francis & Sinclair when they say that ‘Corpus data provides us with incontrovertible evidence about how people use language’ (Francis & Sinclair 1994:191).


10. View of all language: The combination of computers and corpora allow a new view of language.[18] As Sinclair says ‘Language looks rather different when you look at a lot of it at once’ (1991:100). Additionally, hitherto unrecognised features of the language can become readily apparent when placed in a large corpus and analysed by computers. Sinclair’s (1991) and Louw’s (1993) discussion on the notion of semantic prosody is a very good case in point where the whole concept was only discovered by reference to a large amount of corpus data.


These points can be seen summarised in Table VIII below, along with the writers who made the points in question.


It can be seen that the reasons for the use of corpora in linguistic research are manifold. As was stated earlier, corpora are no longer derided in linguistic circles as they were earlier, but there has still been some residual hostility towards them (Owen 1993, 1996). There is also the danger that corpora can be seen as an end in themselves. It is important, therefore, to remember that it is the researcher who must do the thinking, not the machine. In addition to outright criticism, several writers who are in fact very pro-corpora also express reservations on their use, and point to possible pitfalls that need to be avoided. These issues will now be addressed in the next section.






1. Objectivity of results as opposed to subjective intuition

Sinclair (1991), Stubbs (1996), Svartvik (1992), Biber (1995), Biber, Conrad & Reppen (1994)

2. Verifiability of results

Svartvik (1992), Biber (1995)

3. Broadness of language able to be represented

Svartvik (1992), Biber (1995), Biber, Conrad & Reppen (1994)

4. Access

Svartvik (1992)

5. Broad scope of analysis offered by computerised corpora

Biber (1995), Biber, Conrad & Reppen (1994).

6. Pedagogic

Johns (1988), Tribble & Jones (1990), Kennedy (1992), Wichmann et al. (eds) (1997), Flowerdew (1998)

7. Possibility of cumulative results

Biber (1995)

8. Accountability

Biber (1995)

9. Reliability

Biber (1995)

10. View of ‘all’ language and new perspectives

Sinclair (1991), Louw (1993)



5.3.5 Some problems with the use of corpora for linguistic analysis


In the literature there can be found at least four main criticisms of the use of corpora:


·    The first criticism is that corpora focus only on performance-related issues and cannot analyse those aspects of language that are more concerned with competence (Howarth 1998).

·    Secondly, the pedagogical usefulness of frequency lists generated by corpora, and the value of authentic materials for use in the classroom, has been questioned (Widdowson 1990, Murison-Bowie 1996, Howarth 1998).

·    Thirdly, it has been suggested that intuition has been done away with altogether and there has been the cult of complete reliance on machines with the result that common sense has been left behind (Owen 1993).

·    Finally, it has been widely discussed that corpora can suffer from problems related to

their size, representativeness and balance (Renouf 1987, Sinclair 1991, Hudson 1997, Clear 1997, Tribble 1997, Lewis 1999 - personal communication).


The first three of these points will be dealt with in this section. The last point, however, concerning size, representativeness and balance will go to form the next chapter, where these issues will be discussed in relation to the two corpora created for this thesis.


a) Corpora, competence and performance: The views of Chomsky on corpora have been noted in this thesis several times. Thus, no further mention of his views are needed at this point. However, recently the competence/performance dichotomy has been revived in relation to the automatic analysis of multi-word items. Howarth (1998) criticises the automatic analysis of corpora as carried out by Sinclair by saying that ‘such automatic analysis focuses on performance and may exclude considerations of competence’ (1998:26). What Howarth is essentially saying here is that over-concentration on the surface forms of language readily available in computerised corpora, can hide issues of memory usage and production that must essentially underlie them. Thus the researcher needs to consider how multi-word items are processed. Howarth recognises the value of computer corpora but argues that  ‘phraseological significance means something more complex and possibly less tangible than what any computer algorithm can reveal’ (1998:27).

In answer to these criticisms it has already been argued (Stubbs 1996, Sinclair 1991) that the competence/performance divide is invalid. There is no denying that it is a lot easier to statistically count occurrences of words than it is to say why they are there in the first place, or why they occur in the pattern that they do. However, this is not a problem of corpus linguistic methodology per se, but a problem facing all linguistic analysis. Corpora give the opportunity to take advantage of the very best sources of information which can then be utilised to perform further analysis. Thus, the latest work using corpora, as reported by Flowerdew (1998), is in fact now delving behind the pure ‘performance’ data of Howarth and is looking at language from a discourse and genre-based perspective (Tribble, forthcoming).


b) Frequency and pedagogy: Corpus studies have been criticised on account of their preoccupation with frequency (Murison-Bowie 1996). When using a corpus, usually the first and most obvious statistics available are those of frequency. However, ‘Raw frequency figures for individual word occurrences tell one comparatively little’ (Murison-Bowie 1996:188). Frequency, therefore, does not necessarily mean significance. Howarth concurs with this view in relation to the teaching of collocations by saying that ‘a notion of significance based solely on frequency risks giving unwarranted emphasis to completely transparent collocations such as have children, which may occur frequently ... but are quite unproblematic for processing’ (1998:26-27- Howarth’s use of italics).


It must be held as true that raw frequency data cannot be seen as the sole criterion by which a vocabulary item be included in teaching materials. It was seen earlier, however, that frequency of words was long regarded as crucial in the classroom, resulting in the vocabulary control movement in the first half of twentieth century. This kind of information should still be considered valid today (Francis & Sinclair 1994:191). Frequency data can be combined with other factors such as range, utility and coverage in order to present students with the most useful language. Additionally, the concept of delexicalised language has shown that the most frequent words, previously ignored by structuralist grammars, are in fact key elements in the generative power of lexis. They therefore need to be given more attention than they previously have. Frequency data is the first, but not the last, step in determining what language students should be exposed to. It is, however, a first step that is essential, and supersedes previous views on the value of introspection in performing this function.[19] This aspect of corpora and pedagogy will be revisited later in this thesis.


c) Machines vs intuition: Owen (1993) disparages the creation of a grammar based on corpus evidence. His objections are many and a discussion on the grammars created from corpora is outside the scope of this thesis. However, he attacks certain aspects of corpus use that are directly relevant to this study. He bases his criticisms of corpora on several grounds[20] and questions the value of computer-aided corpora in general, arguing that ‘total reliance on a corpus does not necessarily yield better observation, and that observation, when achieved, does not automatically equate with better explanation’ (Owen 1993:168). He particularly criticises Sinclair for supposedly excluding intuition altogether as a resource in linguistic study. He concludes the article by saying that over-reliance on corpus data ‘leads to irrelevance, oversight, and misrepresentation’ (Owen 1993:185).


Sinclair’s reply (Francis & Sinclair 1994) contests Owen’s accusation that intuition had been abandoned. He stresses that intuition was still a part of the COBUILD corpus study, but it was an intuition based on concrete evidence and not pure introspection. Moreover, there can be no scientific justification for preferring one researcher’s intuition on language over a body of data gathered from a 170 million-word corpus of authentic text.


Owen has not been the only writer to caution on the over-reliance on automated data production. Svartvik (1992) notes that despite the vast advantages of automatic data processing, there is still in many circumstances no replacement for laborious manual work by the researcher. He also warns that corpus data can become abstracted from their context as end-users often only have access to texts that were originally speech and that speech is not available to them.


A sensible approach to corpus linguistics then, should utilise everything the machine and the corpus have to offer, but also be guided by intuition where necessary. As Svartvik concluded ‘the best machine for grinding out general laws out of large collections of facts remains...the human mind’ (Svartvik 1992:12).


5.3.6 Corpora use in this study


If one refers back to Murison-Bowie’s (1996) scale of attitude towards corpora use mentioned at the beginning of this chapter, one sees that this study veers towards the strong end of the scale. This not to say, however, that nothing about language can be said without corpora. However, a study of this nature is essentially correlative; it first compares Business English to general English and then the Business English of published materials to ‘real’ Business English. This kind of study is not possible to do at an intuitive level. Intuition can help in the interpretation of the results, but the primary data must come from authentic sources and be able to be analysed empirically. This empirical and quantitative research is seen as necessary to act as a balance to the purely intuitive teaching materials in Business English that were discussed in Chapter 3. The positive aspects of a corpus-based study, it is proposed here, far outweigh any possible negative side-effects. Yet this study is also aware of the criticisms of corpora and therefore does not rely purely on automatically processed data. The study of semantic prosody, to be described in Chapter 9, for example, makes use of corpus data, but can only, at this time, be carried out manually and, to some extent, therefore, intuitively. Thus, to paraphrase Owen (1993:185), the corpora used in this study are the servants, not the masters.




5.4 The next chapter


The previous two chapters have reviewed the key issues involved in Business English and lexis, whilst this chapter has placed the thesis in the methodological framework of British linguistic analysis, correlative register analysis and latterly, corpus linguistics. The next chapter considers the last unanswered questions noted above: those concerning the size, representativeness and balance of corpora. This is done in relation to the two corpora created for this thesis. Other aspects of corpus creation are also investigated, including data processing and the storage of information for later retrieval. This leads to Chapter 7, where the research questions and precise methodology employed in this work are laid out in full.















[1] Stubbs’ work has been published twice, originally in 1993 and then in an extended format in Stubbs (1996). In this section the references given are from both versions.

[2] Stubbs is probably referring here to a corpus size that could say something about the whole language. In this research only Business English is under analysis and the corpus size is considered adequate. See Chapter 6, Section 6.2.1 for more discussion on corpus size.

[3] The one exception to this is 10,000 word extracts from 5 business books. This was unavoidable as the inclusion of whole books would have severely skewed the composition of the corpus.

[4] This contrasts markedly with the qualitative genre analysis approach as espoused by Swales (1990), which is grounded in ‘knowledge of the relevant social purposes within which the text is embedded’ (Yunick 1997:325). Yunick argues, however, that the two approaches should be seen as complementary rather than oppositional. Correlational work, as done by Biber, can identify ‘significant patterns of meaning making which might not emerge from ethnography alone’ (Yunick 1997:326).

[5] The word ‘purpose’ here is meant in a broad fashion to include both a Swales-type definition of communicative purpose  (1990:58) as discussed in Chapter 3, and a more general social/functional  purpose, i.e. language will change according to the reason and circumstances in which it is being used.

[6] See Chapter 4, Section 4.3.2 for a detailed explanation of this.

[7] Lewis (1993) criticises Willis and the COBUILD teaching materials for starting from the word as he suggests that although the COBUILD team made a breakthrough in dictionary making ‘the language teaching materials based on the same criteria were seriously inhibited by a resistance to other types of lexical item’(1993:92).

[8] See Chapter 4, Section 4.3.2 for more details.

[9] Francis uses the term BC (before the computer) to refer to early corpora-based studies before the advent of computerisation.

[10] See Kennedy (1998) Chapter 2 for a very good history of the use of corpora.

[11] For a further historical account see Francis (1992) and comments in Ma (1993a). See also McEnery & Wilson (1996:2-4).

[12] It should be noted, however, that  Michael West’s GSL made semantic distinctions for word senses and counted the frequencies of them.

[13] See Stubbs’ Principle 2 above for more on this.

[14] Church & Mercer (1994) relate the rise of corpora to the resurgence of empirical methodology that had been popular in the 1950s but had gone out of fashion in the 1960s and 1970s. They suggest that empirical methods (of which corpus linguistics is one) revived due to the rise in the use of computers, the increased availability of data and a greater emphasis on ‘deliverables and evaluation’ (Church & Mercer 1994:21-22).

[15] See Svartvik (1996)  for a history of work on corpora carried out at Lancaster University by Leech.

[16] This latter category was only a prediction of Leech’s, but it has now come true.

[17] Though see Owen (1993) discussed later in this chapter.

[18] A good example of how corpora and computers can be used to discover new aspects of language is the study of lexical landscaping in business meetings (Collins & Scott 1996). In a lexical analysis of British and Portuguese meetings, Collins & Scott were able to establish a lexical ‘landscape’, showing the collocational links between key words gained from the meetings, and how these contributed to the ‘aboutness’ of the meetings by forming into ‘complex units or non-sequential topical nets’ (Collins & Scott 1996:11).

[19] The key word analysis of Business English carried out in this thesis relies in the first stages on pure frequency to compute the key words themselves. Thus, whilst pure frequency plays little part in the lexical analysis, it forms the statistical basis on which the ‘keyness’ of words is established.

[20] Owen mentions that Firth was suspicious of computers and would not have approved of what has seemingly been done in his name (i.e. Owen is referring to Sinclair’s COBUILD work).