“I would like to take this course because the subject is highly interesting to me but I am worried about the risk. Suppose I scored 100% in some component, is there any chance my marks could be adjusted down just because other people have performed well? I have worked really hard in my academic career to build up a WAM [weighted assessment mark, see also grade point average or GPA] of XX and I’d rather take another course if this has any chance of dropping.” — Paraphrased from numerous anonymous late year students asking about a first year course.
Of all the years to start a new admin role, 2020 was a rather interesting one. Notwithstanding all the coronavirus uncertainty over international student arrivals early on, the urgent decamp from campus to fully-online teaching in the space of weeks around late March, and the rapid adaptation to change in everything from lab classes to final exams thereafter, probably the biggest eye-opener for me this entire year has been the looming train-wreck that our recent obsession with the minutiae of metrics in the higher education sector holds for the future of student learning.
A favourite story from my undergraduate years is my first tutorial for Higher Complex Analysis in second year. Our first tutorial started with the tutor stating point-blank “Welcome to higher complex analysis. This is a hard course. The typical fail rate is 50%. Look at the person next to you… One of you is going to fail.” To my right, was Mr Sydney Grammar with his ATAR (Australian Tertiary Admission Rank then TER) of 99.95 and to my left, was Miss Elite Private School with a TER not far off. Half the room was the cream of Sydney’s elite private and selective public schools, and I was not from one of them. As Hunter would say…
A modern student (and some follow my twitter, so may read this) would probably ask why on earth I would risk a scraping pass or a credit when I could just walk a HD in the ordinary level course. What about my WAM? The answer is that no one cared about WAM back then, it didn’t even exist, that we students could see at least. What more, we barely cared about marks even, at least in the way a modern student does. A HD was always nice, and a credit was a sign that you could have done better. Besides, the marks arrived by snail mail in the middle of the long university breaks, when our minds were on better things.
Back to the question of why though? Because I wanted to push myself and see if I could hack the pace in higher maths courses. That was the culture, if you took yourself seriously as a student, you set out to take courses because they a) seriously challenged you and/or b) were interesting. Doing easy courses was just boring and ‘being soft’. And in the courses that didn’t meet a) or b), often the mantra was something along the lines of “50 is a pass, 51 is a waste of effort that could have been spent elsewhere”. There were courses that had legendary status in terms of difficulty and they attracted serious interest. Taking a shot was respected, and coming out with a lot less than a HD was totally fine, and definitely not cause for major anxiety.
I think about what would happen if the first tutorial for higher complex in 2020 started like it did in 1994. Some would probably drop the course, but that likely happened in 1994 too. I’m sure the tutor was ‘putting the fear’ into us so we wouldn’t underestimate the course or lack solid commitment (Garth Gaudry was the lecturer, really nice guy but he absolutely did not ‘pull punches’, it was fierce). Nowadays, there would probably also be a stack of emails like the one at the top to deal with, questioning how exactly the marking will work to protect their WAM, possibly even a stack of formal complaints about the tutor as well for intimidation, causing anxiety, or suggesting unfair marking practices.
“Their mask reflects what you seek and that is what makes it so nice at first. A manufactured mirror of your dreams” — Tracy Malone
But why are we not surprised? When we ourselves fill our funding proposals with meaningless journal impact factors and obsess/boast about our h-indices. When in a year full of important crises to solve, key amongst them the ever declining funds available for teaching and research, some see high importance in coming up with yet another global university ranking scheme to measure by and boast about. When our students receive an email soon after their exams with their marks, and a key line is their term and overall WAMs… to several decimal places! And it’s not just my own campus — every campus in Australia and many around the world are no different. Look at any modern graduate CV and more likely than not there is a WAM or GPA or equivalent, prominently placed, given to decimal places. With all of this, we all soon become like an academic version of Narcissus, staring into our pool, constantly all-absorbed by our own key performance indicator, letting it dominate our whole existence.
I find it personally amusing, particularly for science and engineering students, because when you come into a first year physics course, what is the first thing that you learn? Uncertainty and significant figures. You start by asking how accurately you can measure things, and then how to present your results so that your level of certainty is clear. For example, if a quantity is 110 to plus or minus 15, then writing 109.4857843 with no stated uncertainty is probably a glaring misrepresentation of your level of trust in that number. All those decimal places are meaningless garbage with an uncertainty at the +/- 15 point level, and we dock the students marks accordingly in their assessments for bad use of significant figures.
Meanwhile, here we are, presenting grade point averages and weighted assessment marks to several decimal places. The question begs asking: What is our real uncertainty on such a number? From a statistical perspective alone, it’s at the very least the standard deviation in the numbers making up that average. Yet, how many students have all their grades such that the standard deviation is less than 1? The decimal places are obviously moribund from the outset, and we haven’t even got down to the uncertainties in the underlying grades yet either. Amusingly, I can go look up my own transcript, which now includes a WAM (the printed transcript with my testamur from 1997 has no such number). My WAM is a touch over 85, and the standard deviation… 8.7. Yet, there is my WAM in the system as 85.733, presented as though that 3 in 1000 is a truly meaningful digit that an employer can bank on. Am I better than someone with 85.732 and worse than someone with 85.734, truly?
And at least locally, the problem is exacerbated by the culture that’s set in high school, where the fixation is on another single numerical score — the Australian Tertiary Admission Rank (ATAR). However, as quantities they are on very different levels in terms of uncertainty and statistical significance. The ATAR is a number between 0 and 100, given to 2 decimal places, that reflects a ranking rather than a statistical average mark. For example, an ATAR of 95 means you are in the top 5% of your age group (based on age not on year level). The statistical set underlying the ranking has approximately 55,000 students in it, and there is due consideration for the course set taken, relative difficulty, etc. with a rigorous and reported process. It may well be sensible at the smallest increment (0.05) albeit incremental, and certainly at integer level.
The easy mistake to make is that a WAM or GPA is similar. But WAM is not a ranking nor is it statistically underpinned by all the courses of 55,000 students. It is little more than the raw average of maybe 30-odd scores between 0 (ideally 50) and 100, often with no weighting for course difficulty or year level, with an inherent lack of statistical significance or good certainty, i.e., low standard deviation. To treat it as meaningful as the ATAR is would be a huge mistake, yet, for the most part, that’s exactly what most people do.
The disturbing thing is that we as a sector present WAM/GPA on official documents as though it has any real meaning at all — we pull the same bullshit charade that we do with h-index, impact factor and university rankings — with our students as the victims this time rather than ourselves. Given a number so prominent and delightfully trivial for employers to discriminate on the basis of, it would be utterly miraculous if employers didn’t do this and do so en masse. It may not be the thing that decides between two final candidates (I’d hope not, I’d seriously disrespect as utterly negligent any employer who did so), but it almost certainly discriminates the first cut with nary a deeper look at the strengths and weaknesses that make for the standard deviation in the marks. With WAM determining your foot in the door to higher levels of recruitment processes, it’s entirely clear why students live in absolute terror of this number, with its whims influencing their every decision. Particularly given that nowhere have I ever seen a clear statement on transcripts and other such documents about the uncertainty in such a number or warning about the extreme care that should be exercised in its interpretation.
“That was one of the problems with the Narcissus figure. Here is a face looking at a face, and the problem is the image of the thing is never actually the thing. You try and grab it and it’s not there. It’s water. It disappears.” — Jane Alison
There’s a destructive assumption inherently built into WAM, which is that it somehow accurately measures student merit. This is a furphy for a number of reasons. First and foremost is the Matthew effect, namely that students from better socio-economic backgrounds and access to elite private schools are better prepared to come into university, hit the ground running, and score highly in their first 1-2 years of studies. Many students go to schools where they were lucky to be adequately prepared for the higher school certificate (I taught myself both my HSC chemistry electives, plus other components of other courses to optimise my performance chances), let alone prepared in advance to excel at university. For many students, that first or two year of adjusting just entrenches the privilege disparities of high school into their WAM and chances after university, effectively turning meritocracy into stealth-cover for aristocracy. And that’s before we look at other factors influencing ability to excel at university, such as having the luxury of parents who can support you to work full time on studies vs those who need to carry a job to pay the rent and eat.
Second is the lack of any real difficulty weighting. For example, whether you get a 95 in complex analysis or a 73 in higher complex analysis, those 6 units of credit count the same in many WAM or GPA schemes. Yet, who would you consider the better student — the one who took a shot at the hard subject and did well, or the one who didn’t push themselves at all and banked the super-high grade? Here we see our first perverse incentive — it discourages students from doing harder subjects in the quest of fully extending their knowledge. It is like an olympic diving contest were every dive has a difficulty score of 1 — why would you risk a front four-and-a-half when you can just execute a perfect pin-drop?
The third follows from second. Many of the WAM/GPA schemes I’ve seen don’t account for the year level of the course either. This opens an enormous perverse incentive known as ‘WAM-gaming’. The idea being that you invest your electives in taking subjects at low levels (1st year) focusing on those with a reputation for having easier assessment or higher grades, since they numerically improve your WAM. This incentivises behaviour like we see in the quote at the start of this post — students taking courses solely for their impact on a numerical score. Is this what we want to see in higher education? I’d argue it isn’t, yet, that’s what we achieve.
Fourth there’s also not really any factor for the type of assessment either. A course could be almost all small pieces of assessment, such as labs or online quizzes, it could be dominated by a whole-of-term major work, like a portfolio, or it could almost be entirely a tail-loaded pressure assessment, for example, a final exam. All these assessment types carry different levels of both difficulty and access to invigilation (i.e., ability to avoid cheating). And within some assessment types there are also issues with measurement of assessment type aptitude over real ability, as well as poor assessment design.
Regarding aptitude versus ability, the classic example is heavily weighted final exams. There’s already been much written on this, see here or here, but a big issue is that they often tend to test a students aptitude for doing exams more than they test actual ability in the taught material. Like many students, I learned to become good at exams, to put the pressure aside and do all the things that enabled me to snaffle up marks like a dog chasing treats. I’m not convinced they tested my knowledge and understanding that much, even the well designed ones (I’ll get to the poorly designed ones in a moment). They were almost entirely about short-term cramming of huge volumes of algebra, and I was good at the mental tricks required for that. I’ve also seen enough cases of the inverse — perfectly good students who clearly understand the material, perhaps even better than I did and still do, but who seem to collapse in a heap under exam pressure and massively under-perform. And the most ridiculous part of the whole lot — a real workplace is nothing like a final exam, so what is the actual point of this? My job, even as a professional physicist, is not dumping algebra onto a page from my brain under time pressure. Ultimately, we use a contrived and pointless exercise to develop a meaningless number with high uncertainty that then determines career prospects, it’s literally insane.
And that’s before I even get to the highly variable quality of assessment in higher education. How truly poor the assessment design can get is best illustrated by another example from my undergrad student days. A lecturer, who I won’t name, gave a horrifically hard exam the year before us, it was carnage, on raw marks for just that exam the whole class is sure they failed spectacularly. We spent weeks with the previous years’ paper, teasing out reasonable solutions to all the problems — putting students under some time pressure is one thing, but giving them a 3 week assignment to solve in 2 hours is another. Anyway, said lecturer made the amateur error of not knowing we had access to the past paper and giving the same exam two years in a row. Needless to say, we were prepared — we knocked the damn thing out of the park. The lecturer was stunned and told us all he suspected we had cheated as we couldn’t all be that smart. No shit, Sherlock. But the point is, unlike, e.g., ATAR, there is sometimes very little accounting for continuity of quality of the assessment between years or across cohorts in a single institution, let alone between one university and another. There isn’t even standardization of the curriculum and assessment at national level in real terms comparable to ATAR. All of these WAMs and GPAs measure different things in different rules in sub-statistical ways, even within a single university, let alone between them.
I wrote earlier about the uncertainty arising from the statistics of averaging being at the 5-10 point level, that’s before we even account for uncertainty in the underlying 100 point grades for each course. Those too are potentially no better than to +/- 5-10 points in real terms at best, perhaps more given an exam can be a good day or a bad day entirely at random over the cohort. By the time you get to the bottom of the four effects above, you’re probably looking at a real uncertainty up at the +/- 25 level, and at that point, you really can’t say a lot more than not competent, competent and very competent. Ultimately, a measure like WAM or GPA is about as predictive of performance for any job you’d like to choose, including being an academic, as height is for performance in basketball. There is some correlation but if you want to make accurate predictions from the integers or decimal places, you are demonstrating nothing but your own foolishness…
“There is no innovation and creativity without failure. Period.” — Brené Brown
Yet, despite this all, WAM or GPA has come to be the dominant factor motivating student decisions across in many universities in Australia and internationally. It makes students into risk-averse, highly-stressed individuals focused solely on chasing a number, much like the academics who teach them, who are endlessly chasing citations and h-indices, and the organisational leaders that are entirely beholden to whatever it takes to jump one spot in some ranking table or other. From the bottom to top, our obsession with the minutiae of single metrics has made us totally lose the plot in the last two decades, and in particular, totally lose sight of our actual mission — producing a next generation of young professionals that are not just highly trained but well adjusted individuals ready to contribute to society.
We profess to value innovation, creativity, leadership, resilience, and in particular, an ability to think critically in a world awash with knowledge and information. We expect them to think about uncertainty, statistical significance and the search for true meaning in noise. And then what do we do? We construct them a totally bullshit metric that we know should earn you a fail in any proper academic course, and happily turn a blind-eye while everyone buys into it. We sell the first few grips up a greasy pole full of stupid sub-statistical performance measures that are about as much a reality as true measures of achievement as pixies and bats in the desert.
So what is the solution?
I am becoming a huge fan of abolishing WAM/GPA and numerical grading on all courses entirely. Throw the whole lot in the bin, it’s largely meaningless, detrimental to education, and as we’re seeing now in the 2020s, driving a whole massive online industry of contract cheating and proctoring wars that I think can only lead the sector down a path to its own destruction. Contract cheating is as much our stupidity being weaponised for private profit as we see our obsession with impact factor and ‘prestige’ journals being similarly weaponised.
A three-point grading system for each course should be more than sufficient, something like not competent, competent and competent with merit*. The first one is only permanently recorded if it’s not ultimately substituted by one of the other two grades. The third one gives the students something to stretch themselves for. This is more than enough to give any employer an initial triage on competence and ability and enable proper management of degree progression. We can still have marks for individual assessments, it’s a good thing for students to have a measure for how they are performing. But there’s no need for those to be tallied and documented on a transcript.
A breakdown into three-point grades gives students room to push themselves and be adventurous with their learning. It gives them the space to sometimes even fail or underperform on a task in a way that’s invisible, thereby enabling all the life lessons that failure brings, e.g., resilience, ability to be positively self-critical, etc. to be achieved without life-long punishment. The simplified grading system also lends itself to assessment that is more continuous across a course, producing less negative stress for the students as a result, with the ability to implement diversity of assessment type and, where required, ‘hurdle requirements’, to enable proper confirmation of competence.
Ultimately, we have to make a decision about what higher education is about and focus on doing it. Is it about developing students or is it about taking money for poorly measured credentials? I would argue our mission is the former, and if we aren’t careful, our obsession with the latter is going to get entirely in the way of our mission and render us ripe for disruption.
*An alternative would perhaps be a four-point grading system, with not competent, competent (equivalent to current PS/CR) and perhaps two bands of competent with merit, which would cover off the current DN and HD grades. This would in some senses be the minimal change that essentially is just a dropping of all insignificant numerical digits, rebadging of the grades, and proper admission of the true uncertainties involved in assessment.