OFSTED Inspector training Autumn 2018 - Assessment reflections

JUSCO was recently invited to the OFSTED inspector training session in Manchester. This post is about the session devoted to assessment. The session had three aims:

To ensure inspectors understand how they can evaluate the purpose of assessment.
To highlight the inferences inspectors that inspectors can draw from assessment data provided by schools.
To provide further support to inspectors in reaching valid and well - evidenced judgements on pupil’s progress.

It is worth noting that prior to the session a pre reading pack was shared and Professor Daniel Koretz’s book “Measuring up - what education testing really tells us” was listed, chapter two “What is a test?” was highlighted for inspectors to read.

What is a test - chapter 2 in a nutshell:

If you get the chance, do read Daniel’s book as it really is worth it and I fully accept that I don’t do it justice here.

Daniel starts the chapter by discussing a 2004 Zogby poll of voting intentions in the race for the Whitehouse between George W Bush and John Kerry. He notes that the 1018 likely voters surveyed indicated a 4 percentage point lead to George W Bush and that this was reasonably accurate as the final result gave Bush a 2.5% margin.

He goes on to point out that it wouldn’t be possible to measure the voting intentions of 121, 000, 000 people and so pollsters instead use the sample as a proxy measure to make a prediction. The ability to do this depends on the design of the sample, the wording of the question and the willingness / ability of respondents to provide answers.

Koretz points out that educational achievement tests are much the same and that they are proxies for a better and more comprehensive measure that it would not be possible to obtain in a test (Without the test lasting many days and having a very great number of questions!)

A test can only sample for the broad range of knowledge a student has of a particular subject (Domain) and so the accuracy of a test will depend on the careful sampling of content and skills. It will also depend on the wording of the question. In the OFSTED training an example similar to the one below was given:

What inference would you draw from data that revealed that 85% of the class answered this question incorrectly?

Bob has six sweets. Bob perambulates towards the nearest purveyor of fine confectionery and proceeds to purchase another half dozen sugary delicacies. After debating about whether to consume them himself, he decides generously, to apportion his sweets equally between himself and his playmate, Alice. How many sweets is Bob left with?

From the above example it is clear that mastery of the mathematical principles is unlikely to be the barrier that prevented 85% of a class answering this question correctly!

Koretz points out that in addition to the wording we also need to consider the motivation of the test takers to do well and also the behaviour of others, in particular teachers. He notes that “If there are problems with any of these aspects of testing, the results for the small sample of behaviour that constitute the test will provide misleading estimates of a students’ mastery of the larger domain” (D. Koretz - Measuring Up - p. 21) He goes on to say that “A failure to grasp this principle is at the root of widespread misunderstandings of test scores” “And it has also resulted in uncountable instances of bad test preparation by teachers and others, in which instruction is focused on the small sample actually tested rather than the broader set of skills the mastery of which the test is supposed to signal” (D. Koretz - Measuring Up - p. 22)

It was interesting to see the following quote appear in the conference notebook:

“Individual students’ scores on individual test may be thought as “...an inadequate report of an inaccurate judgement by a biased and variable judge of the the extent to which a student has attained an undefined level of mastery of an unknown proportion of an indefinite amount of material” Paul Dressel, 1957

The message that test data should be treated with caution or as Koretz puts it “... don’t treat “her score on the test” as a synonym for “What she has learned” (D. Koretz - Measuring Up - p. 10) was perhaps illustrated with an activity similar to the one below:

In a meeting with school leaders, you are presented with a vast amount of in-school assessment information. It shows that pupils are doing much better than published data. You are not familiar with the methodology and have no way of knowing whether it is valid.

What are the potential issues with such data?
What would you do next?

The point of the task was to encourage inspectors to ask questions beyond the data presented. On the table I was sitting suggestions for finding out more included book scrutiny, pupil conversations and also asking what, in leader’s opinion, had happened that would account for the improvement.

Glossary as provided to inspectors Autumn 2018

Components: Are the building blocks that have been identified as the most useful for subsequent learning.

Composite: Is the complex activity / skill /performance the components will combine to achieve.

Construct underrepresentation: occurs when a test fails to test all of the intended target area construct. A Year 9 end of year English exam may only include some comprehension questions and a creative writing task, whereas the ‘big idea’ about which inferences will be drawn is ‘English’. With other aspects of ‘English’ such as speaking and listening, writing non-fiction and so on, the construct - ‘English’ is under represented. In reality we would be unable to create a single test that assesses ‘English’ without construct underrepresentation - it would be too big an cumbersome. So, we have to be precise about what aspects of the big idea are being assessed, and then ensure that, over time, we sample from these areas in order to build up as reliable picture of what a child knows, can do and understands, as is possible.

Construct- irrelevant variance: sometimes occurs because tests are just too big and try to do too many things. You end up with a partial assessment of one thing and a partial assessment of other things, leading to inferences that become very confused and, ultimately, meaningless.

Reliability: consistency from one measurement to the next. For example, bathroom scales are reliable if the measurement of an object’s weight is consistent. Of course, the scale might be ‘reliably wrong’ if the scales are badly calibrated. If findings from an assessment are replicated consistently we can say they are reliable.

Validity: the quality of inferences drawn from a test. A particular assessment might provide good support for one inference but weka support for another. Often inferences drawn from assessments have low validity because the test is not testing what the user thinks it is testing.

Transfer: The application of knowledge. A distinction can be made between ‘near transfer’ and ‘far transfer’. If you are transferring information to a very similar context that is ‘near transfer’ such as a question that asks about the same material in a new way. Applying information to a novel problem would be ‘far transfer’. Transfer is difficult to achieve but is the main goal of education.

Inference: a conclusion reached on the basis of evidence and reasoning.

Domain: Tests are about making a measurement, and generally, tests are trying to measure something huge. The technical term for what we are trying to measure is the domain. The domain that the tests are trying to measure is the extent of a pupil’s knowledge and skills and their ability to apply these in the real world in the future.