6.1 Introduction


From the beginning of this thesis the need for empirical and corpus-based research in the field of Business English has been stressed. This chapter, therefore, addresses these issues by presenting initial information on the two corpora created for this thesis. This is done in relation to four key factors of corpus creation:


·    Size: The size of both the BEC and the PMC had to be determined.

·    Balance: The necessity of producing corpora that were balanced and representative led to decisions regarding sampling and textual choice.

·    Data collection: Great potential problems of data collection had to be overcome and access to companies had to be gained.

·    Data preparation and entry: The data had to be prepared for entry into the corpora and then entered and catalogued. This included issues of  transcription and data storage.


Each of these stages necessitated decisions that will be discussed in this chapter both in relation to the literature, and in relation to the actual corpora created. 


6.2 Corpus size


Before any decision could be made about sampling and representativeness, decisions about the overall size of the corpora had to be made. The question of the size of corpora has been central to recent corpus development, and there has been the overriding belief amongst many corpus creators that ‘biggest is best’. We have already seen that Halliday and Sinclair proposed a corpus of around 20 million words in 1966. Whilst this was unrealistic at the time, it would certainly be regarded as very modest today. Corpora have thus grown in size encompassing the ‘three generations’ of Leech (1991) from several hundred thousand words, to several hundred million in the latest. This view of the need for large corpora is summed up by Sinclair (1991) who is speaking from a lexicographic point of view when he says that ‘The only guidance I would give is that a corpus should be as large as possible and keep on growing’ (1991:18). Sinclair bases this need for large corpora on the fact that words are unevenly distributed in texts and that most words occur only once.[1] Thus ‘In order to study the behaviour of words in texts, we need to have available quite a large number of occurrences’ (Sinclair 1991:18). Writing the following year in 1992, Svartvik predicted that corpora would only get larger.


Whilst this view of corpora has been the prevailing one, it has not gone unchallenged. Leech (1991), after cataloguing the rise in size of corpora, goes on to say that ‘To focus merely on size, however, is naive’ (Leech 1991:10). He gives four reasons why biggest is not necessarily best:


·    Firstly, a large collection of texts does not necessarily make a corpus; there has to be an element of balance and representativeness to the texts included.

·    Secondly, the massive rise in the size of corpora can largely be explained by an almost exclusive inclusion of written, to the exclusion of spoken, texts.

·    Thirdly, large corpora present massive problems with copyright - the bigger the corpus, the bigger the problem.

·    Finally, Leech notes a lack in available software to analyse the large corpora adequately.[2] 


Leech’s concerns have been echoed more recently with a movement that is more concerned with corpus exploitation than corpus exploration (Ma 1993a, Flowerdew 1998, Tribble 1997, 1998, forthcoming). This movement sees the value of smaller corpora and stresses their pedagogical purpose over their lexicographical potential.

Small corpora, it is held, can be very useful, providing they can offer a ‘balanced’ and ‘representative’ picture of a specific area of the language (Murison-Bowie 1993:50). At time of writing this recognition of a need for smaller, more specialised corpora is very  much on the increase. Ma (1993a) notes that the division of corpora based on size is between corpora that are used for examining ‘general’ English, and those that are used for examining more specific areas of language use. The usefulness of smaller corpora is seen to be a pedagogical usefulness, as opposed to a general explorative usefulness -  Ma relates this utility of smaller corpora to ‘groups of learners’ (1993a:17). He also lists a number of ‘pedagogic’ corpora ranging in size from a corpus of philosophy texts at 6,854 words to a corpus of over one million words (Ma 1993a:17). Tribble (forthcoming) also takes a pedagogical perspective. He notes that ‘Corpus linguistics has, to a large extent, developed from an agenda that has been driven by lexicographers, descriptive linguists and the NLP research community’. This has led to a view that the large size of a corpus is all-important. Tribble disagrees with this, and although not rejecting large corpora, suggests that they ‘provide either too much data across too large a spectrum, or too little focused data to be directly helpful to the majority of language teachers and learners’ (Tribble, forthcoming). Tribble uses what he calls ‘exemplar texts’  to exemplify genres, whilst keeping the overall size of the corpus down to manageable levels. He adds (Tribble 1997)[3] ‘If you are involved in language teaching rather than lexicography, single word lists from small selective corpora can be seriously useful’.


Further pedagogical usefulness of small corpora is suggested by Howarth (1998:33-34). Howarth suggests that larger corpora may be good for extracting general patterns of native speaker speech, but believes that smaller corpora are needed for analysing the language of non-native language learners. Data on the averaged performances of many native speaker writers is satisfactory for a representative corpus, but when the same principles of data collection and corpus size are applied to non-natives, it can lead to a loss of detail and ‘one loses opportunities for identifying significant differences among learners’ processing mechanisms and cognitive strategies’ (Howarth 1998:34).

To summarise, corpora that have been used for lexicographical purposes - looking at the whole language - have, perhaps by necessity, always been created to be as large as possible. This need for large size in corpora, therefore, became the dominant way of thinking. Recently, however, the need for smaller corpora - looking at specific areas of the language - has been recognised, especially in relation to teaching and use in the language classroom, and several smaller corpora have been created for pedagogical purposes. As Kennedy (1998) reminds us, ‘A huge corpus does not necessarily ‘represent’ a language or a variety of a language any better than a smaller corpus’ (Kennedy 1998:68). Researchers should therefore ‘bear in mind that the quality of the data they work with is at least as important [as the size]’(1998:68). The two corpora created for this thesis clearly fall within the category of smaller, specialised corpora, and the next section discusses aspects of their size.


6.2.1 The size of the Business English Corpus


The final size of the BEC comes to just over one million running words. This size has been arrived at on the basis of three main criteria: pragmatic, historical and pedagogical.


Pragmatic:  Larger lexicographical corpora such as the BNC and COBUILD run into hundreds of millions of words. This is not a feasible option for one lone researcher to undertake. Additionally, as the BEC consists of almost 50% spoken language, the collection of larger amounts of spoken data would have proved too difficult. Once the decision had been made that a smaller corpus was to be created, the figure of one million words was arrived at as a result of the two remaining criteria.

Historical: The figure of one million seems to be a ‘magic’ number in  terms of older corpora size. Many influential older corpora - what Leech (1991) calls the ‘first generation’ - were around the one million word mark or often much smaller in size. Examples of this are the SEU[4] at one million words, and the Brown Corpus, also at one million words. There was, therefore, a historical reason for the one million word target size of the BEC. In addition to this tradition, smaller, specialist corpora, of which the BEC is one, have often used the one million word mark (or smaller) as a target number of running words. Comparative specialist corpora to the BEC would be the Guangzhou Petroleum English Corpus of 411,612 words,[5] the HKUST Computer Science Corpus[6] at one million words and the Århus Corpus of Contract Law,[7] also at one million words. Fang (1993) in describing the creation of the HKUST corpus, specifically refers to the older generation of corpora, giving their size as one of the reasons for their choice of size. Additionally, he adds that ‘one million words represent a reasonably large proportion of the finite subset of the language under study’ (Fang 1993:74). As the BEC is not meant as a general English corpus, and in line with the specialist corpora noted above, one million words was deemed a reasonable sample size in order to achieve a representative picture of Business English. The remaining reason for the one million word size of the BEC was pedagogical.

Pedagogical: Smaller corpora enable easier access to the data found in them. This in turn leads to easier transferral of results to the classroom. Thus, the BEC as a ‘special English’ corpus as put forward by Ma (1993a), (Tribble 1997, 1998, forthcoming) and Flowerdew (1998) above, has a clear pedagogical purpose which further justifies the choice of its size. 


6.2.2  The size of the Published Materials Corpus


No initial decisions were made on the overall final size of the PMC in terms of the numbers of running words to be included. The sampling procedures, to be described later in this chapter, however, set the final number of books to be included at 33. This, in turn affected the final size of the corpus which came to 593,294 running words. The number of words in the Business English books varied, for example, from Telephoning in English (Bruce 1987) at 3,437 words to International Business English (Jones & Alexander 1996) at 54, 212 words.

The choice of the number of books and the actual books chosen leads automatically to matters of sampling, balance and representativeness. The next sections will discuss these issues firstly with regard to the BEC and then to the PMC. The literature is referred to where appropriate.


6.3  Sampling, representativeness and balance in the BEC


6.3.1 Introduction


As this study is centred around the creation of two corpora, the notions of balance and representativeness of the corpora are of prime importance. In the case of the BEC, to be worthwhile, the language data must be generalisable to the business population as a whole. This importance of creating a representative body of language for analysis is stressed by Renouf (1987), who adds: ‘The first step towards achieving this aim is to define the whole of which the corpus is to be a sample’(Renouf 1987:2).


Unfortunately, any corpus creator is faced with a ‘chicken and egg’ situation. In this case, the BEC is meant to lexically define what Business English is, but in order to create the corpus, one must first decide what Business English is. Therefore, for pragmatic purposes, some initial definition of business needed to be made. Business English for the purpose of the creation of the BEC was defined as the language used in the pursuit, transaction and discussion of business, trade and commerce. It is found in both written and spoken forms.


As a result of this ‘chicken and egg’ situation, sampling and representativeness are both difficult problems, firstly in corpus linguistics as a whole (Renouf 1987, Oostdijk 1991, Clear 1992, 1997, Kennedy 1998) and, more specifically, in the collection of real life Business English data. Companies are loathe to let people in on their proceedings, so the possibilities for demographic or random sampling normally used within the social sciences have been virtually nil. Yet clear sampling methods have been used in the creation of the BEC by the setting up, before data collection began, of well-defined categories of language that could then be filled. The methods used in sampling in the BEC are now discussed in relation to the literature.


6.3.2  Sampling


In any empirical study, the samples of data gathered are, arguably, the most important element - it is on the collected data that the validity of the study rests. The problems of sampling within corpus linguistics, however, have been widely noted in the literature. Clear (1992) gives three main reasons why sampling can be problematic for the corpus linguist:


·    Firstly, there is the problem noted above, that the population from which the sample is to be drawn is poorly defined. This is certainly true in the case of general English, and it is also true of Business English. Any solution as to a definition of Business English must be at least to some extent based on intuition.

·    Secondly, ‘there is no obvious unit of language which is to be sampled and which can be used to define the population’ (Clear 1992:21).

·    Finally, considering the size of any aspect of language, the researcher can never be sure that all instances have been accounted for satisfactorily.


Clear (1997) offers an analogy to explain this situation:


I have a favourite analogy for corpus linguistics: it's like studying the sea. The output of a language like English has much in common with the sea; e.g.

-  both are very very large...

-  and difficult to define precisely,

-  subject to constant flux, currents, influences, never constant,

-  part of everyday human and social reality.


Our corpus building is analogous to collecting bucketfuls of sea water and carrying them back to the lab. It is not physically possible to take measurements and make observations about all the aspects of the sea we are interested in in vivo, so we collect samples to study in vitro.                                                                         (Clear 1997[8])


A corpus, virtually by definition, is therefore biased to a greater or lesser extent. Yet despite the difficulties, sampling is still necessary. Hudson (1997) answers Clear’s message with an analogy of her own:


But you have to agree that taking odd bucketfuls from different parts of the globe on the basis of known facts about local conditions makes laboratory investigations a squidge more informative than they would be if they were based on what could be pumped through a direct pipeline in the Thames estuary.                                        (Hudson 1997[9])



The sampling procedures used for the creation of the BEC will now be set out in relation to the following factors: an exact definition of the population from which the sample has been drawn, extra linguistic factors, macro-generic specification, and sample size and make-up. The population


Two main criteria have been used to define the sample population. Firstly, the definition of business noted above has been used to define what Business English is - i.e. the population for the purposes of this research is people who use English in the pursuit, transaction and discussion of business, trade and commerce. An essentially broad definition is used in order to encompass the diversity that is business. Secondly, the population is native speaker only.[10] Native speakers are defined for the purpose of this

study as those people born or brought up in the UK or US whose first language is English. It was decided to exclude non-native speakers of Business English from the study at the outset. This is not meant to de-value the language use of non-native speakers, but it was felt that a corpus of native speaker speech could be later utilised in comparative studies of non-native speaker speech in Business English. Data has been gathered from native speakers of Business English, primarily from the UK. Some data has also been included from the United States forming 21.4% of the corpus.[11] A more detailed break-down of the corpus can be found later in Section 6.3.3. Extra linguistic factors in relation to the population


Other factors to take into consideration were gender, the regional origin of the subjects and the nature of business and the subjects’ position in it. In this way the corpus was not set up using random sampling methods, but systematically attempted to give a stratified representative view of the Business English that is used, primarily in the UK, today. The approach attempted to use the following categorisations:


1. Men/Women: It is a commonly held belief that business is dominated by men. Whilst this may be the case, Stationery Office statistics (1997:74) showed that in the UK, 44% of the workforce is made up of women. Thus, a decision was made to include this ratio (44:56) of female to male speech if possible in the corpus. In practice this did not prove to be feasible, but female speech does make up 21.13% of the corpus.[12]

2. Regions of captured text: In terms of the written portion of the corpus, samples come from all over the UK and to a much lesser extent, from the US. Precise information on the origins of the authors, however, was impossible to gain. In the spoken section of the corpus, recordings were taped both in the north and the south of England. The geographical location of the recordings, however, belies that fact that subjects from many different regions can be found on the tapes, from Scotland to Yorkshire to London. Thus, whilst every attempt was made to include a broad variety of  regions in the corpus access to companies was so difficult, that the only option was to settle for what was on offer. In the section of the corpus that was recorded by tape, 67,876 words (53.76% of the sample) were gathered in the north of England and 58,367 (46.23% of the sample) words were gathered in the south. Additionally, four negotiation situations were recorded abroad, but with the main speaker being from the south of England (16,450 words). Exact locations are not mentioned here to protect the privacy of the speakers.


3. Level of respondent in business: An even spread of lower, middle and upper management language was thought desirable for inclusion in the BEC as the research discussed in Chapter 3 showed the influence of hierarchy and power on language produced (e.g. Charles 1996, Watson 1997). This division was made difficult, however,  by the fact that in meetings, for example, there is typically a mix of all levels. Additionally, when given correspondence by the companies, it was often not possible to know the level of the authors within the company. There is, therefore, no explicit categorisation in the BEC along these lines, but analysis of the recordings made for the corpus, for example,  reveals a wide spread of levels taking part.[13]


4. Type of Business: These were initially divided into three sectors: service industries, the financial sector and manufacturing and production industries. This categorisation was considered important as the fact that the specialised language of a given industry varies dramatically from another has long been noted in the literature (Pickett 1986a,b, 1988, 1989). Every attempt was made to spread the sample evenly out over these three sectors and this can be seen in the corpus - each section of the corpus includes a wide variety of industries. For example, in the Annual Reports section there are three annual reports: from the insurance, retail and telecommunications industries. In the Business Letters section, there are letters from companies involved in selling and manufacturing fire-fighting equipment, book selling and publication, sports goods sale and manufacture, financial consulting, computer software and a Business Link service provided by a Chamber of Commerce. In the Interviews section there are interviews with people involved with hotel management, chemical and paper development, engine building, bulk container equipment sale and manufacture, publishing, car sales, telecommunications and construction and project management. It was not always possible to get as wide a spread of industries as would have been desired, but the pragmatics of collecting data from companies often restricted choice. A precise break-down of every text in the corpus including information on business area is contained in the BEC Database that accompanies this thesis (this is to be found on the CD ROM - see also Section 6.6). Specification of macro-genres for the samples


Once the total population had been defined, it was necessary to further specify the categories of language to be included as samples in the corpus. For this, typical examples of what constitutes Business English needed to be defined and set up as categories for inclusion in the corpus. These are termed in the thesis as macro-genre. By this it is recognised that the categories are essentially fuzzy at their boundaries and that several smaller genres can go into the creation of one ‘macro-genre’, e.g. the macro-genre of ‘business letters’ contains many kinds of business letter with a wide variety of communicative intent.


When deciding on inclusion criteria of exactly which macro-genre would be used in the corpus, the literature was referred to first. There seemed to be some agreement on the key activities and documentation used in business (Stuart & Lee 1972, Holden 1993, Johnson 1993, Ellis & Johnson 1994, Dudley-Evans and St. John 1998). An initial attempt to



                                                                                         MALE 56% -    FEMALE 44%


                                                REGIONS OF CAPTURED TEXT : NORTH 33%  MIDLANDS 33%  SOUTH 33%


                                  LEVEL OF RESPONDENT IN BUSINESS: LOWER 33%  MIDDLE 33%  UPPER 33% MANAGEMENT




                          WRITTEN                                                                                                      SPOKEN



ABOUT BUSINESS                 DOING BUSINESS                                  ABOUT BUSINESS              DOING BUSINESS


- magazines       50,000               - fax                              60,000

- newspapers     50,000               - letter                           25,000               - TV programs     100,000            - meetings                                  70,000

- books             50,000               - email                          20,000               -  radio programs 100,000            - talking 1-1                              30,000

- journals          50,000               - brochure                     20,000                                                               - negotiations                             50,000

                                                - contract                       20,000                                                               - telephoning                             50,000

                                                - report                         50,000                                                               - technical                                 30,000

                                                - minutes                       20,000                                                               - presentations                           30,000

                                                - annual reports             20,000                                                               - entertaining                             50,000

                                                - memos                        20,000                                                               - training situations                   20,000

                                                - general docs                30,000

                                                - manual                       20,000

                                                - sales leaflet                  20,000

                        200,000                                                 325,00                                           200,000                                                            330,000







stratify the corpus, therefore, utilised Stuart & Lee’s (1972) research which showed in a survey with 11,595 respondents that 49% of all business interaction is speaking and listening, with the rest made up from reading (19%), writing (17%), listening and writing (3%), speaking (4%) and listening (8%). This meant in practice a 50-50 split between written and spoken language. This division of written and spoken language was adhered to in the creation of the corpus, though the final figure was 56% written - 44% spoken.


Whilst useful, Stuart & Lee’s work is now outdated and does not include more recent business methods of communication such as emails and faxes. For more information, the surveys of Louhiala-Salminen (1996) and Barbara & Scott (1996) were also consulted. Louhiala-Salminen’s work specifically pointed towards the centrality of the use of faxes in the modern office. Thus faxes would need to be well represented in the corpus. On the basis of the literature an initial categorisation was made for an ‘ideal’ content specification. This is shown in the ‘ideal corpus content’ in Table IX above.


All the major examples of business documentation use found in the literature were included, along with a provision for ‘general business documents’ which would include any miscellaneous documents, for example forms or Bills of Lading, that are essentially the same in format and so only one would be necessary to include. Also, all the major activities of business found in the literature were included in the plan for the corpus, for example meetings, negotiations and telephoning. The final corpus contained most of these categories, but some compromises were necessary owing to the pragmatic difficulties of obtaining data from companies (these differences are discussed in full in Ideal vs Actual Content in Appendix 12 in Vol. II). Sample size and make-up


Table IX, on the previous page shows the ideal specification of the corpus as defined by the discussion so far. A further factor that needs to be mentioned at this stage is that of sample size. It was noted in Chapter 5 that early corpora used sample sizes of around 2,000 words randomly taken from carefully selected texts. The disadvantages of this method as opposed to taking whole texts were also mentioned. Yet even when using whole texts, decisions had to made on how many whole texts could be included to be representative of a given macro-genre. Further, there are large variations in document size, especially within the written genres. For example, a fax may be forty words long, whilst a report may be several thousand. Therefore, setting the number of documents that would represent a macro-genre was not feasible and rather, a total number of running words was pre-determined as required for each macro-genre, and the data was gathered for each until the number of words reached the stated amount. Decisions on the number of words that could be representative of each macro-genre, however, were difficult to make and there is some disagreement in the literature.


Some writers, notably Sinclair (1991), whose predilection for large corpora has already been noted, suggest that smaller samples of language in a corpus cannot be representative of their genre: ‘If a million words is hazarded as a reasonable sample of one state of a language, then the sub-categories necessary to balance the sample are not in themselves reasonable because they are too brief’ (Sinclair 1991:24). Sinclair is here referring to what he termed sample corpora of general English which used random selection of extracts within pre-set genres, (written only).  It can be argued, therefore, that Sinclair’s view does not hold valid for specialised corpora that study only a specific area of the language: in this case Business English. Other writers have criticised the small sample size of the early corpora but have suggested that an increase to around 20,000 words would provide a sample of adequate size to be representative of a genre. de Haan (1992) argues that ‘Experience with samples of 20,000 words has shown that on the whole these are sufficiently large to yield statistically reliable results on frequency and distribution’ (de Haan 1992: 3). Oostdijk (1991) and Kennedy (1998) agree. Oostdijk says that ‘A sample size of 20,000 words would yield samples that are large enough to be representative of a given variety’(Oostdijk 1991:50). 


 The sample sizes set for the BEC, therefore, used a minimum cut-off point of 20,000 words for each macro-genre. It is recognised that this is not a perfect answer to the sampling problem[14] but it should be stressed that this was regarded as a base point, and several examples of genres were planned to be considerably bigger than this. This was for two reasons:


Firstly, 20,000 words worth of faxes could produce a broad range of fax types and styles owing to the fact that they are usually relatively short documents. On the other hand, contracts and reports are by their very nature much longer documents, and one example could easily be 20,000 words and so not be in any way representative of the genre. Thus certain sections had to be bigger if the whole text policy was to be followed.


The second reason for some categories in the corpus being larger than others was that an attempt was made to weight the corpus in favour of those activities or documents that needs analysis research has shown are more common for business people to deal with on a day-to-day basis - Louhiala-Salminen (1996), for example notes the importance of faxes, and Nelson (1997) found students thought business correspondence in general to be important.


The final result, shown below in Table X (p.252), is a compromise of these factors and real-life constraints in collection. Whilst it was fairly easy to get annual reports, for example, as they are in the public domain, the gathering of several hundred faxes, letters and emails presented problems. These problems are discussed in more detail below with regard to the concepts of balance and representativeness.


6.3.3 Balance and representativeness in the BEC


The discussions on size and  sampling above have necessarily touched on questions of representativeness and balance. This section, therefore, is brief and simply states the position taken. It is taken as true that a perfectly representative corpus is impossible to define and create (Clear 1992, 1997, Kennedy 1998). Lewis (personal communication 1999) also pointed out that when talking about the representativeness of a corpus, it is necessary to ask the question ‘Representative of what’? In reality, there are so many variables that Lewis is of the opinion that the notion of ‘representativeness’ in its entirety is a ‘non-concept’. Kennedy concurs, saying that ‘it is not easy to be confident that a sample of  texts can be thoroughly representative of all possible genres or even of a particular genre or subject field or topic’ (Kennedy 1998:62). Clear (1997) summarises the problem succinctly as he continues the discussion in the Corpora discussion group with Jean Hudson noted above:


I wonder a great deal about this balance and representativeness issue. How did it rise to such prominence as a litmus test for corpora? The UK dictionary publishers are very much to blame. When the British National Corpus was put together by OUP, Longman, and Chambers there was an intensification of the ‘balance’ issue. In 1991, at one of the OED/Waterloo conferences sponsored by OUP, a debate was organized with the motion ‘A corpus should consist of a balanced and representative selection of texts’. Randolph Quirk and Geoff Leech proposed the motion and John Sinclair and Willem Meijs opposed it. The motion was defeated. Jean, like many users of the Birmingham/Cobuild Bank of English corpus, wants to find out how to ensure that  a corpus will be balanced, and she hopes to compare word frequencies, ranks, correlations, best-fits, smoothed approximations, chi-squares, log-likelihood,... but it is a chimera which she is pursuing.                                                    (Clear 1997)[15]                                                                                          


Any attempt at corpus creation is therefore a compromise between the hoped for and the achievable. Reference to previous needs analysis studies (e.g. Stuart & Lee 1972, Louhiala-Salminen 1996), however, offers the possibility of stratifying and balancing the corpus in favour of those activities business people actually participate in. Two important decisions, therefore,  were made in setting the balance for the BEC: 


1. Firstly, there was the decision to include spoken language in the corpus. It was considered that no corpus that purports to represent Business English as a whole could omit spoken language. As it is probably true that spoken language is used much more than written in the world in general, it could be argued that spoken language should represent the majority of the corpus size. For a single researcher this was not possible to do. Every tape that was recorded of the different business events had to be transcribed ‘by hand’ and was a time-consuming and laborious task. Thus, a compromise was struck between written language that was relatively easier to collect and spoken language that was more difficult. Spoken language represents 44% of the BEC.


2. Secondly, it was considered important to distinguish between language used for actually doing business, and that used for talking about business. This corresponds to Pickett’s (1988) idea of knowing and acting in Business English as discussed in detail in Chapter 3. Pickett suggested that there is a difference between the language needed by pre- and post-experience learners: essentially divided into language needed for knowing a about business in general, and language needed for actually doing it.  Pickett drew a table to show this division and proposed that ‘we ought to be able to supply words and expressions [in each box] that would not just as easily go in another’ (Pickett 1988:91). Whilst there are certain problems in defining these concepts, for example, one can talk ‘about’ business whilst in the process of ‘doing’ business, it is proposed, in line with Pickett, that the two kinds of language are essentially different. Previous Business English materials created from corpora have concentrated on language used to talk about[16] business and also concentrate only on written examples of text (e.g. Mascull 1993). This shortcoming was deliberately avoided in the BEC.


The division in the corpus is 59% language for ‘doing’ business and 41% for ‘talking about’ business. It had been hoped initially to keep the ‘doing’ business section up to 80%, but availability of data forced a lower percentage of inclusion.  A break-down of the BEC is now given below where it is divided into four sections: writing section: about and doing business; and the speaking section: about and doing business. 












5 extracts from different books

(approx. 10,000 words each)[17]



121 articles



52 articles


196, 607












3 annual reports



29 business press releases


29, 602

13 contracts/agreements



114 faxes



94 letters



17 reports



13 company brochures



202 emails



87 job advertisements



5 manuals



47 memos



15 sets of minutes



19 product brochures



21 quotations



OHT, job description & agendas


379, 096












24 interviews



72 broadcasts


219, 877
















6 interviews


















126, 243






sales and marketing

presentation of new products



16, 450

4 negotiation sessions


30, 414

89 phone conversations




5 speeches


17, 867

1 session of technical training






The key elements of the content of the BEC are summarised in the following charts:







Male: 544, 926 words

Female: 216,261 words

Unknown: 262,095 words



Fig. 22 Gender division in the BEC by percentage of words









UK: 804, 002 words

USA: 219,019 words




Fig. 23 UK/US language in the BEC shown by percentage of words








Doing: 606,537 words                       

About: 416,484 words

Spoken: 447,318 words         

Written: 575,703 words


Fig. 24 The Spoken/Written and Doing/About divisions in the BEC                                                                               shown by percentage



It can be seen from this break-down of the actual BEC corpus that it differs to some extent from the original ideal specification for the BEC corpus. These differences are discussed in detail in the Ideal vs Actual section in Appendix 12 Vol II, along with a detailed break-down of sections of the corpus. The next section will look at the make-up of the PMC in respect to its sampling, balance and representativeness.


6.4 Sampling, balance and representativeness in the PMC


For the purpose of the study, the initial population for the PMC was defined as Business English (UK) published materials from the period 1986[19]-1996. In order to gain a representative sample of this population, the PMC presented different problems from the BEC. A representative sample of  Business English materials was needed in order to analyse exactly what the language of Business English teaching materials is. Renouf (1987:17) described how COBUILD created their ‘TEFL’ corpus of EFL teaching material using a questionnaire which was distributed via the British Council to ask teachers which books were most used. For a lone researcher with no connections of this nature, however, an alternative strategy had to be devised. The problem was solved in the following way:


1. In 1997, seven major distributors of EFL materials were contacted by phone and were asked to provide a list of their best-selling Business English titles of 1996.

2. Five of the seven responded. Actual sales figures were not available, but the rank order of popularity was obtained from each bookshop.

3. Once the lists were collected the books were ranked according to their position of popularity at each book shop and averaged out over the five so that an overall ranking of popularity was obtained covering all five bookshops.

4. A total of  38 books were obtained for the final list.

5. Of these, five were rejected for the final list to be included in the corpus. The books were rejected for three main reasons:

a)  Firstly, all the grammar and reference books on the list were excluded as it was felt that they did not represent teaching material as such, and were composed mostly of isolated sentences and exercises.

b) The second reason for rejection was that some books, for example, Build Your Business Vocabulary, which was overall the third most popular Business English book, could not be included for formatting reasons. All the exercises were gap-fill for the students to complete - it was not feasible to start filling all the blanks and then scanning them in afterwards.

c) The third reason for rejection was also related to formatting. One book, Writing for Business, was excluded because it was formatted in such a way that the scanner could not read the text and the whole book would have had to be inputted manually. The final list of books used in the PMC are shown below in order of their overall popularity in the five bookshops.[20]






Business Opportunities


Business Objectives


New International Business English


Business Class


Insights into Business


Business Matters


Handbook of Commercial Correspondence


Financial English


LBES: Giving Presentations


Language of Meetings


LBES: Presenting Facts and Figures


LBES: Meetings & Discussions


In at the Deep End


Business Idioms International


LBES: Negotiating


Business Challenges


Business Games


LBES: Socializing


Business Basics


Developing Business Contacts


LBES: Telephoning


Keywords in Business


Business English Pairwork


Survival English


BBC Business English


Presenting in English


Company to Company


Business Listening and Speaking


Early Business Contacts


Written English for Business 2


English for Int Banking and Finance


Business English


BMES: Personnel




Once the books had been gathered, a decision had to be made on what language would be used for inclusion in the corpus. Books are full of rubrics, exercises and word lists as well as longer texts and examples of various kinds of Business English. There were few guidelines to go by, so contact was made with Professor Magnus Ljung of Stockholm University, who had created a TEFL corpus for his (1990) study. Professor Ljung suggested that only pure texts be included as he had used this method in the creation of his corpus: all rubrics had been stripped away, as well as all exercises.  It was initially decided to follow this advice. However, on closer examination of the Business English books in question it was soon discovered that many books were made up of little else than exercises and very few clearly discernible ‘texts’ in the sense of separate reading/spoken texts were present in the books. Thus, a compromise was reached. All rubrics were excluded. All exercises that were in some way ‘gapped’, i.e. that there were gaps which the student had to complete, were also excluded. Only full examples of language that purported to be examples of Business English, both written and spoken were included. This could be in the form of exercises, examples of correspondence and longer and shorter texts. Single-word word lists were excluded. If the same phrase was repeated several times as part of an exercise, only one occurrence of the phrase was included in the corpus. The decision of what constitutes Business English and what does not, noted above, was left to the researcher, thus introducing an element of intuition. This was, however, unavoidable.


The main factor, therefore, behind the content of the PMC was popularity of use of the books - it was considered of prime importance to have a corpus that represented the Business English books that teachers and students actually use. Once the sample had been gained in the manner described above, the books included could be broken down in terms of type of book included, gender of author and those books focusing on one or more of the ‘four skills’. These divisions are shown in the diagrams on the next two pages.









Fig. 25 The PMC divided into 70% resource books (23 books) and 30% course books (10 books)



The resource books are further divided into the following topic categories:












Fig. 26  Resource books in the PMC



The balance of male/female writers is shown in Fig. 27. Here, female also includes books where at least one author was female:










Fig. 27  Gender distribution of authors of books in the PMC



Finally, the break-down of books specifically devoted to writing and speaking skills is shown in contrast to those that cover all skills.  As can be seen, most books (denoted general in the chart)  encompass both skills.










Fig. 28 Books devoted to speaking, writing or general skills in the PMC



Once specifications had been made as to the content of the corpora, the data had to be actually collected and entered into the computer. This stage represented the most difficult of all in this research and, in all, took just over three years to complete.


6.5 Data collection and entry


6.5.1  Data collection for the BEC


It has been well documented in the literature that the collection of  business language from authentic sources is extremely difficult (Ellis & Johnson 1994, Dudley-Evans & St John 1998, Williams 1998). Thus, the collection of such a large amount of data from actual companies was a daunting task. Three alleviating factors played a part in helping to solve this problem. Firstly, there was large amount of information publicly available and easily accessible through the Internet. Secondly, personal contacts in the business world were used in order to gain access to companies that would have otherwise remained inaccessible. These first two factors corresponded with the kind of information gathered:  in the former, data in the talking about business category was collected e.g. from books, journals, newspapers, radio and TV programmes; in the latter, private data in the doing business category was collected. Finally, after all personal contacts had been exhausted, there was still a large amount of data that was still needed. For this reason, the Chamber of Commerce in a large British city[21] was approached in order to ask for assistance in gaining access to companies. This proved a very successful strategy - three companies then opened their doors either fully or partially in order for data to be collected. Below is a brief review of the data collection process.  Publicly available data


Data here was gathered from a variety of sources - newspapers, journals, magazines and a number of sites on the Internet. As noted above, this largely constitutes business language that is publicly available and thus is concerned with talking about business rather than the actual process of doing business. A full break-down is available in Appendix 12 in Vol. II in the Ideal vs Actual section, and in the BEC Database on the CD ROM. Private data


‘Private’ data here refers to data that is not in the public domain. Private data was gained primarily from two sources - companies that were accessed using personal contacts and companies accessed via the Chamber of Commerce that agreed to help.


Use of personal contacts: This category was of extreme importance - in order to get access to confidential data a certain level of trust has to be in operation between the researcher and the business people involved. Thus, if one is personally known to the subjects beforehand, the chances of actually getting data are greatly increased. However, even with people one knows already, it was not always easy to persuade them to help. There had to be a degree of polite ruthlessness - depending on their position within a given company, it was sometimes easier for them to refuse to help than to assist in the data gathering process. Thus a certain amount of persistence was needed to get into even these companies. Below is a break-down of the type of companies accessed through personal acquaintance and the kind of data gained from them.




Company’s area of  business

Data gained

1. Sportswear sales, manufacture and distribution

tape recorded meeting, correspondence,

tape recorded phone calls, interviews


2. Publishing

tape recorded meetings, tape recorded phone calls, negotiations

3. Financial Consultants[22]

tape recorded meetings, interviews, correspondence

4. Software


5. Ship-board cargo handling


6. Chemicals

correspondence, interview


Chamber of Commerce: Previous contact had been made with the Chamber on a teaching matter, so it was deemed a good starting point. An initial meeting was set up between a representative of the International Division of the Chamber and the author and his supervisor. A summary of the corpus project, prepared to give information to the Chamber and also to possible companies that might allow access, was handed over to the Chamber representative. The representative promised to present the project to the next gathering of their International Committee. This was done, and five companies gave permission to approach them. Of these five that were approached, three of the companies gave either limited or unlimited access to data. Two companies failed to give access: one company in the end refused,[23] the other agreed but was unable to give data, owing to the busy timetable of the people involved - they could simply never be pinned down to a meeting. Additionally, the Chamber, as an example of a service industry, allowed access to one of their Business Link meetings and also gave samples of their correspondence. A break-down of all data gained via the Chamber is shown below.




Company’s area of business

Data gained

1. Project management / construction

tape recorded meetings, correspondence, tape recorded interviews

2. Transportation


3. Fire equipment manufacturers and marketing


4. Railway equipment manufacturers

unable to set meeting

5. Rubber, plastic & cable manufacturers

refused entry

6. Chamber of Commerce Business Link

tape recorded meeting, correspondence


6.5.2 Data collection for the PMC


In contrast to the data collection process for the BEC, the collection of data for the PMC presented no difficulties. The 33 books included in the PMC were simply gathered together for the next stage - entry of the data into the computer.


6.5.3 Methods of data entry in the BEC


Sinclair (1991:14) identified three main methods of preparing data for entry into a corpus: adaptation of data in electronic format, scanning and keyboarding. These three categories are now used to describe how data was prepared for entry and entered into the BEC. Adaptation of material already in electronic form


Apart from a limited amount of data donated by one informant, material in electronic format was not readily available, other than from the World Wide Web (WWW). Therefore, the WWW was used and appropriate data was downloaded (in total 344,029 words) and stripped of its HTML format by creating a text file of it. All embedded computer instructional text was then removed by hand, e.g. phrases such as [image] or [Home] which abound in this environment. The data was then catalogued and entered into the corpus in the correct macro-generic category. Conversion by optical scanning


The scanning program Cuneiform 3.0 was used to scan the text. This program, after scanning the text, created a rich text file (.rtf). The rtf was then stripped of all excess formatting by opening it in Wordpad 1.0 for Windows, and then was saved as a text file. This very often had the effect of upsetting the format of the text and a great deal of work needed to be done to get the text into a presentable format.[24] It should also be noted that scanning often gives very good results, but if the quality of the original text has degenerated in any way, for example, if it was a photocopy, then accuracy can go down to 40 or 50% (see Renouf 1987:9 for more on this).  Thus, every text had to be very carefully processed to make sure that the computer text matched the original. This, obviously, was very time-consuming - 373,011 words were scanned into the BEC. Conversion by keyboarding


The spoken texts recorded for the BEC were recorded using a Sony TCM-453V voice activated tape recorder with an external Sony ECM-F8 desk-top microphone. The telephone conversations were recorded with the same tape machine but with a stick-on ‘pick-up’ microphone that was attached to the telephone receiver. The total cost of this equipment was under £50 and so does not represent top-of-the-range recording equipment. It was, however, perfectly adequate for most purposes.


This section represented the most laborious of all the three. All the tape recorded meetings, negotiations, interviews and telephone conversations were transcribed by hand by the author. The total amount of data transcribed in this way was 250,589 words. An average of two to three hours of work was spent for every 1,000-word block of text.[25] Thus, a minimum 500 hours’ work was spent on transcribing this section of the corpus. Data preparation was helped by use of IBM ViaVoice Gold speech-to-text speech recognition technology. Each recorded tape was played, via headphones, to the author. Then after listening to a short section of the tape, the author re-spoke the same section of text into a microphone attached to the computer. The ViaVoice Gold program then automatically turned this speech into text. Once the text had been fully entered for each tape, it was listened to again and any mistakes were corrected manually. A third and final check of the tape-to-text accuracy was then made, and mistakes again corrected manually. Thus, every tape recording was listened to at least three times. It should be stressed, however, that speech-to-text accuracy is extremely variable in quality - sometimes it can be 95% but at others, especially when specialist words were being constantly used in meetings, for example, accuracy dropped sharply and had to be manually corrected.


Additional problems were encountered in the quality of the tapes. Most were of reasonable quality[26] but several were of such poor quality that they took a very long time to transcribe - one 90 minute meeting took 20 hours to transcribe. A break-down of the data in terms of mode of entry can be seen below.





Mode of entry

Number of words







Electronic format (on disk)



1, 023, 021





6.5.4  Data entry in the PMC


The books, once collected, were scanned into the computer one by one over a period of seven months. In all 593,294 words were scanned into the computer in this way. The text files created by the scanner were then stored on disk for later analysis by WordSmith 3. This process presented, at times, major recognition problems for the scanner. Most Business English books nowadays are very colourful and text is interspersed with pictures, graphs and diagrams. In many instances the scanner was unable to cope with this formatting and every scanned text had to be manually corrected. Additionally, many books included ‘hand-written’ notes or text that the scanner was unable to read. These, again, were entered by keyboarding.


In the discussion above there has been mention of the amount of text that has been transcribed by hand for both the corpora, but especially the BEC. Transcription, however, is not a simple matter, and many decisions had to be made on a transcription scheme that gives an accurate representation of the recorded events. The next section, therefore, both briefly refers to the literature on transcription, and  shows how the BEC was transcribed.


6.5.5 Transcription


The data found in the two corpora created for this thesis have to a large extent followed the ‘clean-text policy’ of Sinclair (1991) who proposed that ‘The safest policy is to keep the text as it is, unprocessed and clean of other codes’ (Sinclair 1991:21). Sinclair’s reasons for this are two fold: different researchers impose different priorities in corpus data,  and a lack of standardisation in analytical measures would create problems for later research of a different nature. Similarly, there is a lack of agreement on basic linguistic features such as words and morphological division. For the BEC, two versions of the corpus have been created. Firstly, a ‘clean-text’ version, where the corpus consists purely of the texts themselves with no annotation at all.[27] A second version of the corpus of the BEC was then created which was Part-of-Speech (POS) tagged using an automatic tagger - Autasys (Fang 1998) - which assigns a grammatical tag to each word. The LOB tag-set was used for POS assignation. An example is shown below:



The tagging was done in order to assist in the process of colligational analysis and for the convenience of later researchers, but as it is done automatically - i.e. the assignation of grammatical tags is done by the computer program - it is not fully accurate. As the main focus of this study is language found in the BEC, the PMC was not tagged.


The texts found in the BEC come from many different sources and were entered into the corpus by the three methods noted above; scanning, keyboarding and the utilisation of texts already in an electronic format. Data from scanned text and from the WWW needed no transcription scheme as such, and the data was used on an ‘as-is’ basis. However, there were certain problems with ‘dirty-data’ (Blackwell 1993). The scanned data needed to be checked very carefully for spelling mistakes. In addition, as Blackwell notes, words that had been hyphenated in the original text were joined by the scanner, but the hyphens were left in (e.g. bus-iness). These had to be spotted and corrected. Spoken language transcription


The problems of spoken language transcription of corpora have been well documented in the literature, for example Blackwell (1993),  Crowdy (1994), Edwards (1995), Nelson (1997), Carter & McCarthy (1997) and Aston & Burnard (1998). Indeed, Edwards points directly to the fundamental problems of transcription when she says that ‘Far from being an objective and exhaustive mirroring of events of an interaction, a transcription is fundamentally selective and interpretive’ (Edwards 1995:19). The transcriber is thus faced with many choices and problems to solve during transcription. If these choices are wrongly made ‘it can hinder detection of patterns of interest, and give rise to directly misleading impressions’ (Edwards 1995:19). The major problems of transcription are also noted by Crowdy (1994), from his work transcribing the British National Corpus. Crowdy’s list is particularly useful to take account of, and his taxonomy of problems is used here, along with other problems found. In each case the problem is noted and followed by the solution arrived at for the BEC.


1. Defining speaker turns: Each speaker’s turn is separated by a line and the participants were given labels, e.g. NS1, NS2 to represent each native speaker (NS).

2. Defining sentences: It is very difficult to define where one ‘sentence’ ends and another begins in spoken language. This was determined in the BEC partly by intonation - if the intonation dipped and so ‘sounded’ like the end of a sentence, the sentence was concluded. Also a longer pause and then a re-start often marked the boundary of one ‘sentence’ and another. However, it was very often only the researcher’s intuition that decided where breaks would come.

3. Overlapping speech: Crowdy used square brackets to show overlapping speech as do Carter & McCarthy (1997). In the BEC overlapping speech was dealt with by placing the words that overlapped at the same point in the text:


1: I said she would be happy ...

2:                                 Happy, yes ....


This feature was not recorded 100 per cent of the time, however. As the focus of the thesis is purely lexical and not discoursal, it was considered more important at this stage to only record the actual words used. Future work with the BEC should rectify this shortcoming.

4. Interruptions: If a person interrupted another it was presented as follows:

NS1:  I wanted to tell her ....

NS2: ... the bad news ....

NS1: ...the bad news and she didn’t take it very well ...


Run-ons of one sentence over an interruption were shown by lack of capitalisation of the run-on sentence and the use of dots.

5. Punctuation: Commas were used to show a pause. Dots were used to show longer pauses. Full stops mark the end of a ‘sentence’.

6. Paralinguistic features: These were not noted in the corpus.

7. Non-verbal sounds: Phatic utterances to denote pauses or hesitation have been reduced to er or erm.

8. Inaudible text: In some cases it was impossible to detect exactly what had been said. In these cases asterisks were used to denote missing text : ***.

9. Anonymity: To preserve anonymity the following codes were used:

Company: Companyname

Full name: Personname

First name: Firstname

Company address: Companyaddress

Place name: Placename

10. Variable word forms: Nelson (1997) notes the problems of such forms as alright and all right in transcription. Similarly, hyphenated, separated or joined versions of the same phrase can affect final frequency, e.g.  life-style, life style, lifestyle. The governing principle in the BEC was to keep words separate, e.g. all right. UK English was used as standard throughout, except, of course, in the US texts that were left in their original spelling.  


Once collected and entered into the computer, the data had to be stored in a way that would facilitate easy retrieval.





6.6 Data storage and retrieval


Storage and easy retrieval of data is of central importance in the creation of any corpus that will be used by more than one researcher. For this purpose a database was set up using Filemaker Pro 2.1. The fields of the database allow retrieval of data from the 1,102 texts that form the BEC according to the following criteria:


1. File name: The actual name of each individual file.

2. URL: (where applicable) This allows the researcher to access the original WWW site where the data was downloaded from.

3. Text topic : What the text was about.

4. Text title: Where there was an actual title given in the text, this was noted here.

5. Text source: Where the text came from, e.g. WWW, tape, book, magazine.

6. Text length: The number of running words in a text.[28]

7. Text nationality: The origin of the text, e.g. UK or US.

8. Gender: The gender of speakers or writers of the texts if known.

9. Text type: e.g. spoken, written.

10. Date of text origin: When the text was originally written/spoken.

11. Corpus sector: Which part of the corpus a text belongs to, e.g. W:D: Annual reports, denotes annual reports to be found in the (W) written (D) doing business section of the BEC.


An example of the database can be seen in the picture on the next page.



Fig. 29  An example from the BEC database


There are 1,102 entries in the database - one entry to match each text included in the BEC.


In the background of the data collection process were three factors that now need to be mentioned - how the confidentiality of the respondents was maintained, and the questions of copyright and ethics.


6.7 Confidentiality, copyright and ethics


Confidentiality: It was obvious from the outset that secrecy would be a major concern for any company with a Business English language investigator taking away large amounts of correspondence or tape recorded meetings. In order to get round this problem, a contract was drawn up stating that secrecy would be maintained and that the company had the final and unquestionable right to exclude any data gathered that they felt would compromise their business interests. This worked well in all the companies though there was wide variation in personal attitudes towards secrecy amongst the business people dealt with. This ranged from almost paranoia to one case where a manager threw open large filing cabinets and literally said ‘Help yourself’. Most cases, though, fell somewhere in between these two extremes.


Copyright and ethics: These two related issues are important to any creator of corpora. The problems presented by the gathering of copyright permission are well recognised in the literature (Renouf 1987, Sinclair 1991, Leech 1992, Kennedy 1998). In the case of the PMC it has been relatively straightforward. Once the books for inclusion in the corpus were known, a letter asking for permission was sent to each publisher with a list of the books involved. All the publishers approached responded promptly and affirmatively.


The BEC has presented a more difficult problem, due to the vast amount of different texts involved. At present, permission has been granted for all the spoken and written ‘doing’ sections of the corpus and work is on-going in getting permission from those in the public domain.


It has also been necessary to uphold high ethical standards in data collection in companies. The problems of secrecy have already been noted and the maintaining of  participants’ secrecy has been paramount to the study. The question of ethics has only been problematic in one instance - the gathering of telephone calls. This is a two-sided issue, by which everyone should have the right to know they are being recorded, but at the same time in a business situation it would be impractical to get permission from every caller. A compromise was reached. The phone conversations included in the study were recorded by participants themselves - the researcher was not present during any of them. The participants thus knew they were being recorded. However, the people who they spoke to on the phone did not. To counter this problem, when the tapes were transcribed all personal information of any kind was removed from the transcription - thus no person could ever be recognised from the data. It is recognised that there still remains an ethical problem here, yet it was considered vital to get telephone calls into the corpus as they represent a major part of business activity. The solution hopefully goes far enough in dispelling major problems of ethics and the respondents’ identities will remain permanently hidden.


The final part of this chapter refers to the third corpus used in this research - this time not created for the project, but chosen to act as a reference to the BEC and the PMC.


6.8 The reference corpus


It was discussed in Chapters 1 and 2 , and it will be seen again in the next chapter that in order to statistically determine how the lexis found in the BEC differs from that of general English, a ‘reference corpus’ needs to be used for comparison. For this purpose the British National Corpus (BNC) Sampler Corpus of 2 million words has been used in this research. The population of ‘general English’ is therefore defined as modern British English containing a 50-50 split of spoken and written language. The BNC Sampler was chosen because the division of spoken and written language is akin to the BEC, and the relatively small size of the BNC Sampler simplifies and speeds up analysis. Despite its small size, however, it is a representative sample of general, British English and has been specifically chosen as such by its creators from the much larger full BNC.


6.9 Discussion and rationale


The last four chapters have placed Business English, lexis and corpora in their appropriate methodological backgrounds and have shown how the corpora have been gathered and organised. Underlying these chapters has been a logical rationale for the thesis, springing from the initial presentation of the hypotheses in Chapters 1 and 2.


·    The review of Business English and Business English published materials indicated the need for the creation of corpora in order to examine the field in an empirical rather than intuitive manner.

·    The focus on Business English lexis, which this thesis takes, required that not only single words were studied, but also the manner in which words behave in relation to others, both in terms of collocation and in terms of the formation of longer chunks or clusters of words.

·    Collocation could then be further expanded and organised by using the concept of semantic prosody and associates. 

·    The fact that lexis and grammar combine to create meaning further necessitated the inclusion of colligation in the study.


Thus, each aspect of the study is a direct result of the previous step, going back to the original hypotheses. The next chapter, therefore, re-presents these hypotheses, along with the research questions and method used to carry out this thesis.


[1] This is based on what is known as Zipf’s law of word distribution (Zipf 1935). This states (to paraphrase very simply)  that in fact most words are very rarely used and a small amount of the most common words make up the majority of communication.

See http://sun1.bham.ac.uk/G.Landini/evmt/zipf.htm for an interesting discussion on Zipf and his ideas.

[2] This may have been true in 1991, but the availability of computer programs for corpus analysis has greatly improved in the last decade, for example SARA, WordSmith Tools and MonoConc Pro, to name but three readily available corpus analytical programs.

[3] Part of a message to the Corpora discussion group on the Internet at corpora@hd.uib.no - 9.10.97


[4] Survey of English Usage

[5] Guangzhou Training College of the Chinese Petroleum University, Guangzhou, China

[6] Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong

[7] The Århus School of Business, Århus, Denmark

[8] Part of a message to the Corpora discussion group on the Internet at corpora@hd.uib.no - 9.10.97.

[9] Part of a message to the Corpora discussion group on the Internet at corpora@hd.uib.no - 9.10.97

[10] There are some minor exceptions to this. In one file - Train1.txt - a training session was recorded in the UK in the field of building and construction. The trainer was English, but the audience of 4 men were non-natives. The vast majority of speech recorded was that of the trainer, with occasional questions from the audience. The small contribution made by the non-natives was edited after the event to iron out any obvious grammatical errors and this was subsequently checked with them by interview. Additionally, there are small amounts of non-native speaker speech in one file of BBC radio business broadcasts. These also may have been edited in transcription by the BBC as no ‘mistakes’ in the language are apparent. It must be stressed that these represent a very small amount of language and, it is argued, do not affect the integrity of the corpus as a whole. No results are based on any of these individual files alone.

[11] UK/US English: The two main sources of native speaker English speech in the world come from the UK and the US. This study is primarily concerned with the Business English used in the UK. Yet it is not possible to exclude US Business English - US English is very pervasive in UK society as a whole, and there is a great deal of trade done in the UK with US companies. However, the precise amount of US English a business person would be exposed to is subject to many variables, notably whether the company in the UK does actual business with US companies or not. It was, therefore, decided to include US Business English in the corpus and it represents 21.4% of the total corpus. It should be stated that for the purpose of this thesis, UK/US differences are not an issue and US Business English is simply included because it is a natural part of any native speakers’ business experience to a greater or lesser degree.


[12] Female speech is denoted here as all cases where the text is either purely of female origin, or where at least one of the interactants in the text was female.

[13] Examples are, for instance Meeting 4, where two MDs, two junior sales managers and one senior sales manager are present; Meeting 7 where the head of the engineering department and two engineers are present; and Meeting 11 where a senior partner in a finance company, a senior manager and a secretary are all present.

[14] de Haan goes on to say of 20,000 word samples that ‘even they are too small’ (1992:3) and that it depends on what the corpus is trying to achieve as to whether or not the samples sizes are large enough. 

[15] Part of a message to the Corpora discussion group on the Internet at corpora@hd.uib.no - 9.10.97



[16] Here, when the phrase ‘talk about’ business is used, it also refers to written language that refers to business but is not involved in the actual ‘doing’ of it.

[17] As was noted earlier this is the only section of the corpus where short samples were used.

[18] This section and others fell below the 20,000 sample size but was included because of the quality of data it produced in spite of its small size. A fuller discussion on this matter is found in Chapter 9 and Appendix 12.

[19] This was chosen simply as being ten years back from the year (1996) when data collection took place.


1 Bournemouth English Book Centre  2. English Book Centre, Oxford   3. Brighton English Book Centre   

4. Dillon’s EFL Department, London     5. Keltic ELT Bookshop, London


[21] The actual city where the chamber was situated must be kept secret as part of the confidentiality agreement reached with them as a pre-condition for their help in the project.

[22] Despite an initial personal contact in this company it took one year, a lot of phone calls and two meetings in order to gain access. It was possible in the end only with the very kind help of a senior partner who I had not known previously. He personally pushed the project through, firstly with his own boss, then with local risk management people who were very worried about company liability if anything ‘got out’, then finally with HQ lawyers who gave permission with certain provisos e.g. that no client’s name be mentioned on the tapes. To him, who must remain anonymous, once again my very great thanks for all his hard efforts.

[23] This was a long and complicated process where the Sales Manager agreed to help and was very keen to do so, but the MD refused - despite all assurances that their business would not in any way be compromised.

[24] This will be discussed in more detail in the ‘Transcription’ section below.

[25] Crowdy (1994) reports a transcription rate of 750 words per hour, but in the case of the BEC the transcription rate varied dramatically according to the quality of the tape.

[26] The tapes had been recorded by the participants themselves. The researcher was not present at the meetings as it was thought this may affect the language used.

[27] The only annotation used is to assign labels to the speakers in the recorded speech events so subsequent researchers can follow who is talking.

[28] Counted using MS Word 6 word counter. This statistic varies from the word counter of, e.g. WordSmith, as the programs have different criteria for defining what a ‘word’ is. Thus, for the purpose of the database, the number of words for each text is taken from Word 6 statistics. It was also found that different versions of WordSmith counted the corpus differently - for example WordSmith 2 added approximately 10,000 words to the size of the BEC compared to WordSmith 3. All figures in this thesis are from WordSmith 3.