1 674 Reliability of Quick Placement Tests How much faith can we place on quick paper or internet based placement tests? Michael Berthold University of Southern Queensland Abstract The almost unquestioning faith that many administrators have in the reliability and accuracy of quick paper or internet based placement language tests should be of concern to practising language teachers. Recent trends in the use of these tests suggests there is a strong desire to produce a number or score that will instantly place a student in a pigeon hole, namely an ESL class, for administrative convenience. What can be neglected because of this simplistic focus is that the national identity of students, the linguistic distance between their first language and English, their educational background, their familiarity with computers and level of English will have a profound effect on the reliability of these tests - points that this presentation will develop and discuss. This paper will present two different analyses to challenge the belief that quick placement tests are reliable indicators of students proficiency levels, namely: two types of quick placement tests are compared (one paper and one online) and their reliability tested one against the other, and; the reliability of an online test vis-à-vis face-to-face testing performed on campus. The unquestioning reliance on the accuracy of these tests and the resulting potential misplacement of students in inappropriate class levels has a profound effect upon the learning experience of the students and their ultimate levels of achievement, not to mention the stress placed upon teachers of having some students allocated to their classes who are obviously either lost or bored. Key words: Placement, tests, reliability, accuracy, ESL
2 675 I have been concerned for some time about the almost unquestioning belief that many administrators have in the reliability and accuracy of quick paper or internet based placement tests. This concern is shared by others, with respect to exit English skills, however, the concern is similar. The sector s blind faith in language testing inhibits the development of more robust ways of addressing English language outcomes for graduates. (Arkoudis, 2011) There seems to be a desire to have a number or score that will instantly place a student in a pigeon hole for administrative convenience. I often receive requests for placement advice based on some type of quickie test and sometimes find that the person requesting the information is quite peeved when I refuse to give a definitive answer based upon dubious evidence from a test that purports to be an effective indicator of a person s level of English based upon a tick-a-box test, without a writing or speaking component. Most of these tests consist mainly of ticking boxes, therefore an element of luck is involved. One student might be hesitating between one of two answers and is giving a great deal of thought to the answer, but may make an incorrect choice. On the other hand, another student might have no idea and just gleefully ticks boxes at random and may fluke a relatively high score. I decided to put this idea to the test. I found a site where you can do online language tests for a variety of languages. I tested myself with English native speaker, French fluent speaker, and several other languages. The results were interesting.
3 676 Table 1: My scores on eight language tests performed using multiple choice questions. Language Score (%) Level specified Comments English 97 Intermediate Native speaker French 83 Advanced beginner Proficient Spanish 39 Beginner Non-speaker Chinese 30 Beginner Non-speaker Russian 23 Starting out Non-speaker Irish 23 Starting out Non-speaker Swedish 23 Starting out Answered A to all questions Japanese 5 Starting out Non-speaker Several issues concerned me with using this type of test. How could I, as a proficient native speaker, be judged as only intermediate because I had only one answer wrong? Incidentally, that error was an Americanism and was technically incorrect English. I have the same complaint with the judgement on my level of French. I freely acknowledge that I make some errors of grammar, but to be classified as an advanced beginner is way off the mark as I obtained the equivalent of a masters degree in Applied Linguistics in a French university, albeit more than twenty years ago. I was quite impressed with my scores for Spanish and Chinese, neither of which I speak in any way, shape or form. My greatest disappointment was with Japanese. It was also a remarkable coincidence that I scored 23% on three different language tests.
4 677 The bottom line here is that students with a low level of English who answer every question on a multiple choice questionnaire can obtain results that have extremely low accuracy and reliability. Guessing can give a very unbalanced result and in no way indicates the actual level of the student, with scores ranging in this mini experiment from 5% to 39% on languages that I have no knowledge of. If we were to rely solely on paper based multiple choice questions as placement tests for overseas students then this level of error, or even greater, could/would exist. I cross checked my scores with another website (Oxford University Language Centre) that presented placement tests with fifty multiple choice questions and this time I scored much better. On the French test I was one answer short of scoring advanced the highest rating, and with Spanish, which I guessed, I was only two points off being assessed as Intermediate. Quite different scores to the previous online multiple choice placement tests, and just as concerning. There is no way that I should have achieved such a high rating for Spanish with my nonexistent knowledge. General Concerns: Supervised or unsupervised tests? If the students are not supervised while they are doing the tests then this leave it wide open to cheating. Students may have friends with them who have a greater knowledge of the language, and they may have grammar books or dictionaries with them, especially these days with sophisticated electronic dictionaries. If the test is unsupervised then we have no idea how
5 678 long the student spent on each test unless it is electronic with a timer installed. Who is actually sitting for the test? If we are using paper or online tests we have no idea who is actually doing the test, if they are unsupervised and in some instance even if they are supervised. I often receive an answer sheet with a student s name at the top. I have no idea who this person is, nor if they actually did this test themselves. Whilst in our naivety we expect students to do the test, in some other cultures, the skill is in getting a high result, regardless of the methods used the end justifies the means. Different culturally based learning styles e.g. Chinese versus Arabs. In general, Confucian philosophy which permeates many Asian cultures results in students depending very much on rote learning of grammar and vocabulary. Conversely, Islamic cultures have a very strong oral tradition which is reflected in their language learning. The result of these two examples leads to the generalisation that Asian students are more likely to score highly on grammar and writing based tasks/tests, whereas Middle- Eastern students are more likely to score well on speaking and listening. I realise that this is a generalisation and that there will be many exceptions, however, as a rule of thumb this does apply. So, what effect does this have on interpreting the scores based upon multiple choice questions with a strong grammatical bias? I have found that with 187 students whom I tested with both the Oxford Online Placement Test, followed up with an interview and
6 679 writing and reading tests that in most instances Middle-Eastern students scored higher, and were subsequently placed in higher level classes than their online test results would have suggested. This is discussed more fully later. Paper based tests Prior access to the test? In some cases, especially with the Oxford paper based test, there are many copies available as it can be photocopied or ed and security of the integrity of the exam is virtually nil. I have had agents in other countries, not only give the paper based test to their clients, but also mark them and sent the scores back to me. No security equals no reliability or validity. Access to the answer sheet? Following from the lack of security of the paper is the lack of security of the answer sheet. If the tests are so freely available, one would be naive in the extreme not to believe that there are answer sheets floating around as well. I had one case where two brothers did a test. Out of sixty answers there was only one mark different in their scores. Out of curiosity, I examined their answer sheets in detail. Not only were their results very similar, all of their answers (multiple choice questions) were identical except for one answer. I would love to know what the statistical probability of this happening is there is far more chance of winning Lotto. The agent assured me that the prospective students did the test, supervised in his office, and that any similarity of results was sheer coincidence!!!
7 680 Comparison of results obtained by various testing methods I have for quite some time been sceptical of the reliability of the Oxford pen and paper Quick Placement Test. This test comprises 40 questions (60 for the more advanced students) on a multiple choice basis which is then claimed as being accurate in placing students in a CEF (Common European Framework) band See Table 1. Recently I have been using Oxford s online version of this test which does appear to remove some of the deficiencies of the paper based test and also includes a listening component. This test is interactive and chooses questions of enhanced or reduced difficulty depending on the examinee s result for the preceding question. That is, if the student gets a question wrong, then a simpler question or one of similar difficulty will be chosen for the next question. Conversely, if the question is correct then the next question will be of similar or higher difficulty. The computer not only counts the number of correct responses but also weights them on their difficulty to give an overall mark and CEF rating. This also means that every student does a different individualised test, therefore eliminating the possibility of obtaining a copy of the test in advance and learning/memorising the answers. Table 2: Common European Framework (CEF) Descriptors. Source of table unknown. CEF Score Level Descriptor What does this mean? B1 Can understand the main points of Insufficient English for
8 681 clear standard input on familiar matters regularly encountered in work, school, leisure, etc. Can deal with most situations likely to arise whilst travelling in an area where the language is spoken. Can produce simple connected text on topics which are familiar or of personal interest. Can describe experiences and events, dreams, hopes & ambitions and briefly give reasons and explanations for full academic level participation in language activities. A student could get by in everyday situations independently. To be successful in communication in university settings, additional English would be required. opinions and plans. B2 Can understand the main ideas of complex text on both concrete and abstract topics, including technical Minimum requirement for undergraduate entry. discussions in his/her field of specialisation. Can interact with a degree of fluency and spontaneity that makes regular interaction with native speakers quite possible without strain for either party. Can produce clear, detailed text on a wide range of subjects and explain a viewpoint on a topical issue giving the advantages and
9 682 C1 C2 disadvantages of various options. Can understand a wide range of demanding, longer texts, and recognise implicit meaning. Can express him/herself fluently and spontaneously without much obvious searching for expressions. Can use language flexibly and effectively for social, academic and professional purposes. Can produce clear, well-structured, detailed text on complex subjects, showing controlled use of organisational patterns, connectors and cohesive devices. Can understand with ease virtually everything heard or read. Can summarise information from different spoken and written sources, reconstructing arguments and accounts in a coherent presentation. Can express him/herself spontaneously, very fluently and precisely, A level at which a student can comfortably participate in all postgraduate activities including teaching. Most international students who enter university at B2 level would acquire a level close to C1 after living in the country for several years and actively participating in all language activities encountered at university. This is a highly proficient level and a student at this level would be extremely comfortable engaging in academic activities at all levels
10 683 differentiating finer shades of meaning even in more complex situations. Case Study 1: Chinese students from the Faculty of Business. At the beginning of Semester 2, 2011 twenty students from a university in China were accepted into the Faculty of Business at USQ to do a joint programme in the Bachelor of Commerce (General). These students had satisfied the English requirement set by the Faculty in their arrangement with their Chinese counterpart, that is, they all had a minimum IELTS score of 6.0 or better. This joint programme has been operating for a number of years and the students have achieved remarkably high results in their subjects studied here. Last year we (Dr Joseph Mula and I) tested the incoming students on their English levels using two types of tests developed by Oxford and Cambridge Universities and their Local Examinations Syndicate. We were interested to see:! how each of the tests rated the students according to the CEF, and! what correlation there was between the two tests. All students sat for the paper test which was supervised by us. Unfortunately, with the online test there was a series of technical/internet based problems which caused us to postpone the test twice. As we were relying on the voluntary participation of the students, it was understandable to see that some of the students became frustrated by turning up to a test that did not function and was wasting their time. Consequently there were only 13 students who completed the two tests. The advantage in using this group for our study was that we eliminated
11 684 many of the possible variables, namely: All students were of the same ethnic group. All students had a similar educational background. All students had passed tests in China that attested to a somewhat similar level of English which was sufficient for them to be accepted into second year of the same faculty. All students were supervised during both tests. There was no opportunity to copy answers from one another. The results were analysed in three ways and the results are described and summarised in the following diagrams. The first and the most simplistic method was to look at how many students scored at each CEF level. Table 3: Results of thirteen Chinese students tested on both the Oxford Quick Placement Test (paper based) and the Use of English component on the Oxford Online Test. C1 B2 B1 Paper based On-line (Use of English) Taking this at face value, using only the paper test we would have one student who was not linguistically prepared for undergraduate studies, but according to the on-line assessment there would be six students in this category a significant difference.
12 685 The second method was to compare the CEF ratings on the two tests for each student. The results are presented in Table 4 where the lack of a strong correlation can be seen. There are a number of striking differences. a. Two students who scored C1 (suited to postgraduate studies) on the paper test, only rated a B1on the computer based test (more English needed before entry to Faculty needs to enrol in EAP). b. There were only two students who scored the same rating on the two tests student 1 (C1) and student 8 (B2). Table 4: Individual student s ratings on the CEF scale according to the two tests. QPT (paper) C1 C1 C1 C1 C1 C1 B2 B2 B2 B2 B2 B2 B1 Use of English C1 B2 B2 B2 B1 B1 C1 B2 B1 B1 B1 B1 B2 The third method used was to rank each of the thirteen students from highest to lowest in their group for each test. I hoped to show by this method that there was little consistency in individual student s results, rather than simply comparing their CEF scores. The individual differences here are quite clearly seen. Only two students ranked in the same position in both tests (shaded) whereas the student who scored the worst on the paper test (B1) ranked third (high B2) on the online
13 686 test, and conversely the lowest rated student on the online test (B1) scored fourth place (C1) on the paper test. Table 5: The rank order of individual students according to the two tests they completed. QPT (paper) Use of English It was useful to see what type of distribution there was when I used a scatter plot to try and identify any trends in the data. Figure 1 shows the distribution of the ranked scores. As can be seen there is no defined trend, the scores of the rankings are quite scattered. Figure 1: A plot of students rankings in the paper based test compared to their rankings in the online test.
14 687 I also plotted the raw sores for the two tests for each student Figure 2. This also shows the lack of correlation and we would be hard pressed to draw a line of best fit for either Figure 1 or 2. Vertical? Horizontal? Upward sloping? You will notice that there are only 12 scores shown for 13 students as there were two students who scored the same results on both tests 48 and 60. Figure 2: A comparison of the students' raw scores (numerical) in each test Q) 60 'E Paper versus Online(Use of English) Oxford Placement Tests Paper test 40 so 60
15 688 Possible factors for differences: At this early stage of the analysis we can only speculate on what may have been the causes of such discrepancies. Reasons could be as diverse as: There could be individual preferences for paper or internet based assessment. The 30 minute time limit could have affected some students on the paper test whereas being given sufficient time (maximum of 80 minutes) on the online test (Use of English and Listening) could have reduced or removed one stress factor. The questions posed are not really reliably testing the students proficiency level in English. Case Study 2: On-campus overseas students placement tested using the Oxford Online Placement Test plus interview and writing and reading tests. When we have requests from students living overseas and use the Oxford Online Placement Test (OOPT) we do not test them through interview or writing test. The promoters of the Oxford online claim that theirs is a valid test for class placement as it tests two elements: Use of English, and Listening. The OOPT has been designed, in the theory that underlies its items and in every aspect of the measuring system, to serve one primary purpose the accurate placement of students learning English into a class or a course of teaching that is appropriate to their current needs. Anyone using it for this
16 689 purpose can be confident that it is as valid for that use as possible. (Pollitt, p.13) Intuitively I found that claim to be exaggerated as students learn or acquire their foreign language through a variety of ways and a test that only tests the Use of English and Listening that claims to slot them into a category I found to be overly simplistic. Therefore, when we do placement tests for on-campus overseas students, where possible I have them do the OOPT plus supplement it with a face-to-face interview, reading and writing tests. The writing test comprises two elements: BICS (Basic Interpersonal Communication Skills) day to day language used in social encounters in a community shopping, chatting with people at parties, playing sports, organising accommodation, talking on the telephone, writing about your country or family, etc. CALP (Cognitive Academic Language Proficiency) language used in academic settings where the skills of listening, speaking, reading and writing need to be at higher than functional levels. For example, students would need to be able to compare and contrast, present and support an argument or opinion, analyse texts to find the main points, write coherent and well formed sentences/paragraphs/essays. BY using these two styles of writing tests it enables me to judge more accurately which ELICOS class to place the students in, from Level 1 through to EAPII. However, as Alastair Pollitt admits: 18
17 690 Measurement can never be perfect, and when we try to measure something both unobservable and as elusive as language proficiency there will always be a significant amount of uncertainty associated with the scores obtained. Some of this uncertainty comes from variability in the students: their alertness, concentration, or motivation can fluctuate from day to day, causing similar fluctuations in their scores, and there is not much that any test can do to overcome this effect. The other main source of uncertainty with test results comes from a particular set of items that a particular student meets; would they have done better (or worse) if they had met different items? The OOPT uses an adaptive testing methodology that is much better at reducing this problem than any fixed test can be. (Pollitt, pp.9-10) I agree that the adaptive testing methodology appears to be far more effective than a paper based test, but it is not the whole solution to the problem of placement testing. Comparison of OOPT with actual placement of students into classes For the past two year I have been placing students into different levels of ELICOS and EAP using the above methods, that is: interview, reading, writing and OOPT. I have tested 187 on-campus students during this time and have found that there has been a degree of difference in the results predicted by OOPT and those measured in interview and writing. ELICOS classes include Levels 1-5, and I have called EAPI and EAPII Levels 6 and 7 for statistical convenience. 19
18 691 Figure 3: OOPT 'Overall' scores compared with actual class placement for 187 on-campus overseas students tested in the Open Access College. As can be seen in Figure 3, each class level has a range of OOPT results, and therefore a score in any OOPT band can actually have students in a number of class levels. For example on the scale of 1-20 marks (A1) that is theoretically Levels 1, 2 or 3 in our ELICOS programme, but the reality is we have had students placed from Level 1 to EAPI. The following table explains in more detail the discrepancies between OOPT placements and face-to-face evaluation. 20
19 692 Table 6: OOPT Overall scores and their CEF ratings, comparing theoretical placement of students into various levels to the actual placements after speaking, listening and writing are taken into account. OOPT Overall score CEF A1 A2 B1 B2 C1 C2 Theoretical Levels Levels Levels Direct Direct Direct placement 1, 2, 3 4, 5 6 (EAPI) entry to entry to entry to 7 (EAPII) faculty faculty faculty 1, 2, 3, 3, 4, 5, 4, 5, EAP I, Direct Actual 4, 5, EAP I & EAPI, EAP II, entry to placement EAPI EAP II EAPII, Direct faculty Direct entry to entry to faculty faculty Figure 3 shows that there is a definite trend showing a correlation between scores and class placements, the higher the score the higher the class, but it is far from a definitive predictor of a student s linguistic competence in all four skills areas as there are numerous students who do not fit into this neat categorisation. For example there was one student who rated only a B1 but who was recommended for direct entry into faculty. This student had already successfully completed subjects in his specialisation at another Australian 21
20 693 institution, wrote and spoke quite fluently, but did not find the type of questions asked in OOPT to be something that he was familiar with. There is also a problem with the averaging out of the OOPT scores for the two tests. Students might have quite diverse scores yet have an identical mark overall. Table 7 shows some of these inconsistencies within the range of A2. Table 7: Two examples of variations in the OOPT scores that gave the same overall mark and how the students were subsequently placed after interviews and writing tests. All students were placed in the CEF A2 band by the OOPT. 0 Overall score of 36 (A2) Use of Listening Actual Use of Listening Actual English placement English placement EAP I EAP I EAPI EAP EAP I EAP I EAP II EAP II 22
21 694 The table also highlights a number of anomalies, such as the variance in the scores for Use of English and Listening. The most glaring example of this is scores of 51 and 22 giving an overall of 36 for one student, whereas two other students had 36 and 36 to give an overall of 36, yet they were allocated to different class levels. The student who scored 51 & 22 was from the Congo and had some schooling in Australia and was therefore quite comfortable with how English is used, but had a great deal of difficulty in understanding the British voices on the Listening test. It may seem anomalous that a student with an overall score of 21 should be allowed into Level 5. This student had, however, an IELTS test result of 4.5 from his own country, therefore giving him an automatic entry into Level 5. That, however, is another issue that is too large to be considered here. This same type of inconsistent result often occurred with students from our refugee community. The use of the internet and computers was not something that they were au fait with, and the style of questions did not relate at all to the way in which they had acquired English. On the other hand, some of our Asian students scored highly on the pick-a-box style of questions, but did not perform well at all in the interview stage. The possibility/probability of the relatively low scoring of Middle-Eastern students on OOPT compared to their actual performance in face-to-face testing has already been mentioned with respect to their cultural mode of instruction/learning. This in effect accounts for many of our anomalies as many of our students are of Middle-Eastern origins. 23
22 695 There is also the problem of students scoring very differently in various skills tested. If we are only to trust the OOPT then which of the three scores do we trust (1) Overall, (2) Use of English, or (3) Listening? When students on the OOPT score very similar results in the two parts of the test, they generally also are relatively consistent with their speaking and writing skills. However, disparate OOPT scores are often also reflected in differences with their face-to-face testing. Summary Over the past two years the following points have become obvious, and are supported by an analysis of placement test results. Paper based tests are unreliable Unsupervised tests are unreliable Supervised tests are unreliable if we cannot have faith in the integrity of the supervising authorities Multiple choice tests are unreliable Results in adaptive online tests are definitely superior to the fixed tests (paper or online) and are an indicator of some skills levels, but need to be supplemented by face-to-face testing. Therefore if you are presented with a number with respect to a student s English proficiency level which claims to be definitive and accurate, be suspicious. Do not have blind faith in numbers produced by any type of exam at best they are a good indicator of student performance, at worst they are useless. 24
23 696 REFERENCES Arkoudis, S. (2011, 12 October) English language development faces some testing challenges. The Australian, Higher Education. p.33 Pollitt, A. The Oxford Online Placement Test: The Meaning of OOPT Scores. Retrieved from Oxford University Language Centre: Online Placement Tests. Retrieved from Transparent Language : Language Proficiency Tests. Retrieved from 25