logo  ELT Concourse teacher training
Concourse 2

Testing, assessment and evaluation

testing

Gentle warning: this is a complex area which is littered (some might say infested) with terminology.  You may like to take it a bit at a time.


changing face

The changing face of testing: a little history

Over the years, what we test and how we test it in our profession have seen great changes.  Here are three citations that show how.

  • The more ambitious we are in testing the communicative competence of a learner, the more administratively costly, subjective and unreliable the results are.
    (Corder, 1973: 364)
  • The measurement of communicative proficiency is a job worth doing, and the task is ultimately as feasible one.
    (Morrow, in Alderson JC, Hughes, A (Eds.), 1979: 14)
  • It is assumed in this book that it is usually communicative ability which we want to test.
    (Hughes, 1989: 19)

In what follows, it is not assumed that it is always communicative ability which we want to test but that's usually the case and definitely the way to bet.


define

Defining terms

Why the triple title?  Why testing and assessment and evaluation?  Well, the terms are different and they mean different things to different people.

If you ask Google to define 'assess', it returns "evaluate or estimate the nature, ability, or quality of".
If you then ask it to define 'evaluate', it returns "form an idea of the amount, number, or value of; assess".
The meaning of the verbs is, therefore, pretty much the same but they are used in English Language Teaching in subtly different ways.

When we are talking about giving people a test and recording scores etc., we would normally refer to this as an assessment procedure.
If, on the other hand, we are talking about looking back over a course or a lesson and deciding what went well, what was learnt and how people responded, we would prefer the term 'evaluate' as it seems to describe a wider variety of data input (testing, but also talking to people and recording impressions and so on).  Evaluation doesn't have to be very elaborate.  The term could be used to describe nodding to accept an answer in class up to formal examinations set by international testing bodies but at that end of the cline, we are more likely to talk about assessment and examining.

Another difference in use is that when we measure success for ourselves (as in teaching a lesson) we are conducting evaluation; when someone else does it, it's called assessment.

In what follows, therefore, the terms are used to mean the same thing but the choice of which term to use will be made to be appropriate to what we are discussing.

How about 'testing'?  In this guide 'testing' is seen as a form of assessment but, as we shall see, testing comes in all shapes and sizes.  Look at it this way:

eta As you see, testing sits uncomfortably between evaluation and assessment.  If testing is informal and classroom based, it forms part of evaluation.  A bi-weekly progress test is part of evaluation although learners may see it as assessment.  When testing is formal and externally administered, it's usually called examining.
Testing can be anything in between.  For example, an institution's end-of-course test is formal testing (not examining) and a concept-check question to see if a learner has grasped a point is informal testing and part of evaluating the learning process in a lesson.
Try a short matching test on this area.  It doesn't matter too much if you have all the answers right.


why

Why evaluate, assess or test?

It's not enough to be clear about what you want people to learn and to design a teaching programme to achieve the objectives.  We must also have some way of knowing whether the objectives have been achieved.
That's called testing.


types

Types of evaluation, assessment and testing

We need to get this clear before we can look at the area in any detail.

Initial vs. Formative vs. Summative evaluation
Initial testing is often one of two things in ELT: a diagnostic test to help formulate a syllabus and course plan or a placement test to put learners into the right class for their level.
Formative testing is used to enhance and adapt the learning programme.  Such tests help both teachers and learners to see what has been learned and how well and to help set targets.  It has been called educational testing.  Formative evaluation may refer to adjusting the programme or helping people see where they are.  In other words, it may targeted at teaching or learning (or both).
Summative tests, on the other hand, seek to measure how well a set of learning objectives has been achieved at the end of a period of instruction.
Robert Stake describes the difference this way: When the cook tastes the soup, that's formative.  When the guests taste the soup, that's summative. (cited in Scriven 1991:169).
Informal vs. Formal evaluation
Formal evaluation usually implies some kind of written document (although it may be an oral test) and some kind of scoring system.  It could be a written test, an interview, an on-line test, a piece of homework or a number of other things.
Informal evaluation may include some kind of document but there's unlikely to be a scoring system as such and evaluation might include, for example, simply observing the learner(s), listening to them and responding, giving them checklists, peer- and self-evaluation and a number of other procedures.
Objective vs. Subjective assessment
Objective assessment (or, more usually, testing) is characterised by tasks in which there is only one right answer.  It may be a multiple-choice test, a True/False test or any other kind of test where the result can readily be seen and is not subject to the marker's judgement.
Subjective tests are those in which questions are open ended and the marker's judgement is important.
Of course, there are various levels of test on the subjective-objective scale.
Criterion-referencing vs. Norm-referencing in tests
Criterion-referenced tests are those in which the result is measured against a scale (e.g., by grades from A to E or by a score out of 100).  The object is to judge how well someone did against a set of objective criteria independently of any other factors.  A good example is a driving test.
Norm-referencing is a way of measuring students against each other.  For example, if 10% of a class are going to enter the next class up, a norm-referenced test will not judge how well they achieved a task in a test but how well they did against the other students in the group.  Some universities apply norm-referencing tests to select undergraduates.

There's a matching exercise to help you see if you have understood this section.  Click here to do it.


good or bad

Testing – what makes a good test?

You teach a child to read, and he or her will be able to pass a literacy test.
George W. Bush

The first thing to get clear is the distinction between testing and examining.
Complete the gaps in following in your head and then click on the table to see what answers you get.
test vs exam task

One more term (sorry):
The term 'backwash' or, sometimes, 'washback', is used to describe the effect on teaching that knowledge of the format of a test or examination has.  For example, if we are preparing people for a particular style of examination, some (perhaps nearly all) of the teaching will be focused on training learners to perform well in that test format.


types

Types of tests

There are lots of these but the major categories are

Test types What the tests are intended to do Example
aptitude tests test a learner’s general ability to learn a language rather than the ability to use a particular language The Modern Language Aptitude Test (US Army) and its successors
achievement tests measure students' performance at the end of a period of study to evaluate the effectiveness of the programme an end-of-course or end-of-week etc. test (even a mid-lesson test)
diagnostic tests discover learners' strengths and weaknesses for planning purposes a test set early in a programme to plan the syllabus
proficiency tests test a learner’s ability in the language regardless of any course they may have taken public examinations such as FCE etc. but also placement tests

As far as day-to-day classroom use is concerned, teachers are mostly involved in writing and administering achievement tests as a way of telling them and the learners how successfully what has been taught has been learned.


items

Types of test items

Here, again, are some definitions of the terminology you need to think or write about testing.

alternate response
This sort of item is probably most familiar to language teachers as a True / False test.  (Technically, only two possibilities are allowed.  If you have a True / False / Don't know test, then it's really a multiple-choice test.)
multiple-choice
This is sometimes called a fixed-response test.  Typically, the correct answer must be chosen from three or four alternatives.  The 'wrong' items are called the distractors.
structured response
In tests of this sort, the subject is given a structure in which to form the answer.  Sentence completion items of the sort which require the subject to expand a sentence such as He / come/ my house / yesterday / 9 o'clock into He came to my house at 9 o'clock yesterday are tests of this sort.
free response
In these tests, no guidance is given other than the rubric and the subjects are free to write or say what they like.  A hybrid form of this and a structured response item is one where the subject is given a list of things to include in the response.

ways

Ways of testing and marking

Just as there are ways to design test items and purposes for testing (see above), there are ways to test in general.  Here are the most important ones.

Methodology Description Example
direct testing testing a particular skill by getting the student to perform that skill testing whether someone can write a discursive essay by asking them to write one
indirect testing trying to test the abilities which underlie the skills we are interested in testing whether someone can write a discursive essay by testing their ability to use contrastive markers, modality, hedging etc.
discrete-point testing a test format with many items requiring short answers which each target a defined area placement tests are usually of this sort with multiple-choice items focused on vocabulary, grammar, functional language etc.
integrative testing combining many language elements to do the task public examinations contain a good deal of this sort of testing with marks awarded for various elements: accuracy, range, communicative success etc.
subjective marking the marks awarded depend on someone’s opinion or judgement marking an essay on the basis of how well you think it achieved the task
objective marking marking where only one answer is possible – right or wrong machine marking a multiple-choice test completed by filling in a machine-readable mark sheet
analytic marking the separate marking of the constituent parts that make up the overall performance breaking down a task into parts and marking each bit separately (see integrative testing, above)
holistic marking different activities are included in the overall description to produce a multi-activity scale marking an essay on the basis of how well it achieves its aims (see subjective marking, above)
Naturally, these types of testing and marking can be combined in any assessment procedure and often are.
For example, a piece of writing in answer to a structured response test item can be marked by awarding points for mentioning each required element (objective) and then given more points for overall effect on the reader (subjective).


three concepts

Three fundamental concepts:
reliability, validity and practicality

  1. Reliability
    This refers, oddly, to how reliable the test is.  It answers this question: Would a candidate get the same result whether they took the test in London or Kuala Lumpur or if they took it on Monday or Tuesday?  This is sometimes referred to as the test-retest test.  A reliable test is one which will produce the same result if it is administered again.  Statisticians reading this will immediately understand that it is the correlation between the two test results that measures reliability.
  2. Validity
    Two questions here:
    1. does the test measure what we say it measures?
      For example, if we set out to test someone's ability to participate in informal spoken transactions, do the test items we use actually test that ability or something else?
    2. does the test contain a relevant and representative sample of what it is testing?
      For example, if we are testing someone's ability to write a formal email, are we getting them to deploy the sorts of language they actually need to do that?
  3. Practicality
    Is the test deliverable in practice?  Does it take hours to do and hours to mark or is it quite reasonable in this regard?

For examining bodies, the most important criteria are practicality and reliability.  They want their examinations to be trustworthy and easy (and cheap) to administer and mark.
For classroom test makers, the overriding criterion is validity.  We want a test to test what we think it tests and we aren't interested in getting people to do it twice or make it (very) easy to mark.

There's a matching test to help you see if you have understood this section.  Click here to do it.

So:

  1. How can we make a test reliable?
  2. How can we make a test valid?

reliability

Reliability

If you have been asked to write a placement test or an end-of-course test that will be used again and again, you need to consider reliability very carefully.  There's no use having, e.g., an end-of-course test which produces wildly different results every time you administer it and if a placement test did that, most of your learners would end up in the wrong class.
To make a test more reliable, we need to consider two things:

  1. Make the candidates’ performance as consistent as possible.
  2. Make the scoring as consistent as possible.

How would you do this?  Think for a minute and then click here.


validity

Validity

If you are writing a test for your own class or an individual learner or group of students for whom you want to plan a course, or see how a course is going, then validity is most important for you.  You will only be running the test once and it isn't important that the results are correlated to other tests.  All you want to ensure is that the test is testing what you think it's testing so the results will be meaningful.
There are five different sorts of validity to consider.  Here they are:

validity

To explain:

Face validity
Students won't perform at their best in a test they don't trust is really assessing properly what they can do.  For example, a quick chat in a corridor may tell you lots about a learner's communicative ability but the learner won't feel he/she has been fairly assessed (or assessed at all).
Content validity
If you are planning a course to prepare students for a particular examination, for example, you want your test to represent the sorts of things you need to teach to help them succeed.
Predictive validity
Equally, your test should tell you how well your learners will perform in the tasks you set and the lessons you design to help them prepare for the examination.
Concurrent validity
This may be less important to you but if your test predicts well how learners perform in the examination proper, it will tell you more than if it doesn't.
Construct validity
You need to be clear in your own head about exactly what skills and abilities are needed to follow the course and how you are testing them.  If you can't closely describe what you are testing, the test won't work properly.

Finally, having considered all this, you need to construct your test.  How would you go about that?
Think for a moment and make a few notes and then click here.

Easy.



Related guides  
assessing Listening Skills these guides assume an understanding of the principles and focus on skills testing
assessing Reading Skills
assessing Speaking Skills
assessing Writing Skills
placement testing this is a guide in the Academic Management section concerned with how to place learners in appropriate groups and contains a link to an example 100-item placement test


Of course, there's a test on all of this: some informal, summative evaluation for you.


If you are preparing for Delta Module Three, there's a guide to how to apply all this.


References:
Alderson JC, Hughes, A (Eds.), British Council, ELT Documents 111, Issues in Language Testing, available from http://wp.lancs.ac.uk/ltrg/files/2014/05/ILT1981_CommunicativeLanguageTesting.pdf [accessed October 2014]
Corder, S. P, 1973, Introducing Applied Linguistics, London : Penguin.
Hughes, A, 1989, Testing for Language Teachers, Cambridge: Cambridge University Press
Oxford Dictionaries, http://www.oxforddictionaries.com/
Scrivener, M, 1991, Evaluation thesaurus, 4th edition, Newbury Park, CA: Sage Publications
General references for testing and assessment.  You may find some of the following useful.  The text (above) by Hughes is particular clear and accessible:
Alderson, J. C, 2000, Assessing Reading, Cambridge: Cambridge University Press
Carr, N, 2011, Designing and Analyzing Language Tests: A Hands-on Introduction to Language Testing Theory and Practice, Oxford: Oxford University Press
Douglas, D, 2000, Assessing Languages for Specific Purposes. Cambridge: Cambridge University Press
Fulcher, G, 2010, Practical Language Testing, London: Hodder Education
Harris, M & McCann, P, 1994, Assessment, London: Macmillan Heinemann
Heaton, JB, 1990, Classroom Testing, Harlow: Longman
Martyniuk, W, 2010, Aligning Tests with the CEFR, Cambridge: Cambridge University Press
McNamara, T, 2000, Language Testing, Oxford: Oxford University Press

Rea-Dickins, P & Germaine, K, 1992, Evaluation, Oxford: Oxford University Press
Underhill, N, 1987, Testing Spoken Language: A Handbook of Oral Testing Techniques, Cambridge: Cambridge University Press