Qvatch- random thoughts possibly worth sharing: Computer Assisted Testing, Almost the Original Proposal

Computer Assisted Testing Manuscript
Spring 1981

1. Pre-Introduction

This manuscript is a partial reproduction of a manuscript I published in Spring, 1981, slightly augmented with comments as warranted given the 20+ years since it was written. Since the original manuscript was professionally edited, any errors of grammar and/or usage are mine, induced during the current incarnation. After the CAT discussion, there is an introduction to Computer Guided Reading, a brainstorm which occurred after many years of failing to get CAT even slightly accepted. CGR is a recognition that the reading skills of our students have attrited the same as their math skills, and their acceptance of authority is becoming terrifying, i.e., they are ceasing to question things which should be within their intellectual grasp, and are instead accepting these things “on faith”! CGR was an attempt to correct this tendency. It failed also. Finally, returning to this material, I am posting it as a form of testament (to the failures of a career). I should add that at the time of writing, Spring 2005, the Math Department has begun implementing CAT in a monitored setting. This means that my prediction of many years ago has come to fruition naturally, of course without my help, but what can you do?

At the time of posting, my personal situation has worsened slightly, from a teaching point of view, but that's another story.

1.1 Introduction

The quality of undergraduate academic preparedness is dropping at the same time as grade inflation continues on its rampage. The lack of standard examinations for fourth year students in American schools of higher education makes this statement difficult to quantify The Achievement Test of the College Board and the Graduate Record Examination are voluntary, not mandatory, and therefore are administered to only a fraction of our graduating seniors. but a consensus of colleagues shows that as the reading ability drops, as SAT scores drop, and as the amount of material covered in key courses drops, the performance of our students is dropping also .

Of the indices which we possess to measure this decline, the foremost one is old examinations. The examinations we gave ten+ years ago, and regarded then as quite fair, are now too hard. The knowledge that we expected students to bring to our courses then, is no longer expected of them now. The belief that the teacher’s standards were reasonable is now superseded by the expectation that the class mean sets the standard for class performance.

Accountability has given rise to teacher rating systems that influence tenure, promotion, and salary decisions. Under these circumstances faculty members would be fools to insist on high quality performance, based on their own subjective evaluations, and the facts are that the process of erosion continues.

1.2 What’s Wrong?

My particular set of biases comes from 17+ years of teaching physical chemistry on the university level. My biases about any technical course are quite specific, and it is worthwhile to set these biases out at the beginning, so that they may be examined separately from the question being raised about testing. Technical education is not “life preparation”, culture, or enrichment. At some point technical education in chemistry, physics, mathematics, and all the other sciences and technologies becomes the serious business of building tools for a lifetime of technical work. Regardless of the vocabulary, that is, the particular science or technology being taught, there is a common thread that winds through all technical education, and that is obsolescence. We are all obsessed by the fact that what we learned when we were students is at least partially obsolete now. Therefore, our teaching now is oriented toward the creation of tools and of mind sets. We do not expect the student to remember much, but we expect him to he able to rapidly re-equip himself with any piece of knowledge which he has seen in the past. We accept that the details are lost over time, but want to demand that they be re-creatable when needed. Then, when our students are out in the “real world”, they will be equipped to learn, or re-learn whatever is needed to function in the environment where they happen to find themselves. It is in the re-creation aspect of knowing that our greatest problem clearly presents itself.

Mathematics appears to lie at the center of this problem. Without taking excessive space to catalog the variety of errors seen on examinations, it suffices to note that many of our junior-year students cannot integrate with any distinction. cannot abstract partial derivatives, cannot successfully carry out error-free elementary algebra, and in general cannot translate mathematical equations into meaningful knowledge, or ideas about the physical world into meaningful mathematics. The errors that we see imply a lack of integration of mathematics into the psyche of our students, so that it remains a foreign language, rather than a tool . The fundamental reason for this failure resides in testing (IMHO) and the manner in which we accredit student’s academic progress One of the reasons that testing results in this incredible ignorance on the part of some of our students is the fragmentation it makes necessary of the course material into small digestible chunks (which are examinable). The scheme we use to organize ourselves allows our students to learn an epsilon of information for a given examination, forget that information within minutes of the end of the exam, and relearn that same epsilon for the final. Come the end of the course, the student expurgates the material in a fit of triumph at having beaten the system once again. Years later (or days? ), when the material is suddenly needed, then, and only then, does the student regret his earlier tom foolishness; but, of course, then it is usually too late.

Given the main problem, that material is distributed in quanta, there exists a host of secondary problems associated with these quanta. First and foremost is the problem of testing on pre-passed quanta. Students regard it as universally unfair to grade on material which has been “passed” in a previous course. I have had students argue vehemently that I should tell them the formula for the area of a circle, as they are “not responsible in this course for this knowledge”. The fact that this argument, when carried to its logical conclusion, obviates knowing anything never occurs to our students .

Second, if partial credit is assigned to partial answers, it is possible to partially pass multi-quanta questions without ever actually proving knowledge of any single piece of material .

Another reason exists for the lack of preparation that we see in our students. It is that the fraction of material answered on examinations is used as a gauge of quality of performance. This makes sense at the top and bottom of the scale, but in the middle it fails to distinguish the kind of material that is being missed. No distinction between absolutely essential core material and embellishments is made in the normal test situation. As a result. the grade of C, which technically could be construed to mean that a student knew 70 percent of the tested material, actually could mean that the student knew all the material, but made many silly errors; or, on the other hand. it could mean that the student didn’t know one entire section on an examination, with that section perhaps being part of the core material. It also could mean that the student functioned at about the 70th percentile in his class, with no reference to knowledge whatsoever.

Further, the time-delayed grading scheme does not help with “stupid errors”, the kind that students feel are infinitely excusable. The lack of unit checking, the algebraic silliness, the entire gamut of errors in the areas that are universally conceived to be absolutely important to proper mathematical functioning but below a high level course, these errors detract from the purposes of the testing process and prevent both the student and the teacher from assessing progress and knowledge of subject.

Multiple choice examinations are another source of confusion in this world view I am constructing. Never, after graduation, are there "multiple choices" in technical work (except for decision making). Rather, life poses problems in what educationalists call "constructed response" mode. We've always had the ability to use machine correction schemes for "constructed response" questions, but they have not been widely employed. Instead, multiple choice has been the choice "du jour". When Newsweek had a proposal from Ms. Spelling concerning testing college graduates, the graphics was a pencil lying on a multiple choice form! Objective testing means, in America, multiple choice; and this is just plain wrong and bad all around. Its bad for students' intellects, its bad for the civilization which trusts its results excessively, and its bad in and of itself!

The main function of testing at this advanced level is not to rank students, but rather, to discourage the technically inept from continuing. It is more important to divert the technically incompetent from becoming professional “do-badders” than it is to reward the best students with A’s. For those students who will use their degrees to gain entry into a technical profession, the grades earned are not of paramount importance. The degree is the thing, it almost obliterates the record, and qualifies the owner for the possibility of a technical position in our society. Since that position might be of great importance, the idea that the possession of a degree might not imply technical competence is chilling indeed.

1.3 Some Proposed Changes in How We Examine

If you are primed to read a tirade against multiple-choice testing, you are in for a shock now. We never, never, use the multiple choice format for our examinations. Except for the “standard” American Chemical Society examinations which may be administered by any institution, to my knowledge, all physical chemists examine using problems and derivations. Problems are of the type:

Compute the energy of a molecule of HD in its ground electronic state, its
second vibrational state, and its 43rd rotational state, assuming that the
molecule is not translating. Use the following atomic and molecular constants in
your computation ...

while derivations might be posed in the form:

For a Dieterici gas, obtain an expression for the critical temperature in terms
of the given constants . . . .

An honest attempt in grading is (usually) made to award part credit. First, for many-part questions, errors in early stages that translate forward into future errors are not counted multiply. Second, so called “stupid errors” are “excused” by penalizing the student only microscopically for errors of arithmetic and perhaps algebra. It is an open question how people grade when it comes to calculus errors, but whatever is done, students who are harshly graded resent it sorely. Having passed “Calculus” they feel that being re-graded on it is a form of double jeopardy.

The longer we continue past practices into the future, the more tangled will be the question of how to impose, or re-impose standards. Without being nostalgic for the past, one must ask whether or not standards have been changing with time. It would surely be a better situation if one could definitely say that either students were as good as they used to be, or better

What constitutes the ideal examination? My guess as to a perfect examination is, first, that such an examination satisfies both student and examiner that what is to be measured was indeed measured. Second, a perfect examination should be reproducible within reasonable bounds year after year and student after student. Third, the test should be perfectly unbiased with respect to any and all characteristics of the student such as sex, age, race, and so forth. Fourth, it should judge student responses in real time, as they are being offered so that no self-deception is possible. Fifth, it should guide the examinee taking the test to correct errors on material that are subordinate to that which is actually under examination. Sixth, the perfect test should be patient with slow or error prone students. Seventh, it should allow students to give up if they don’t know material, with the implied promise that when they come back to try again, they will not be judged for having tried once before. If this reminds you of a doctoral oral exam, you're right.

1.4 Why Not Examine on Computer?

Why not have a computer pose the problem to be solved, and grade the answer returned as the student watches? Then, if simple, “trivial” errors are committed, the computer can prompt the student with the location and type of error, and ask for a correction. The computer would have a DON’T KNOW button that would enable the student to face the absolute truth. I happen to be familiar with PLATO , and the rest of the discussion will be couched in PLATOeze .

Assuming that our students aren’t fools, we must presume that pushing the dreaded DON’T KNOW button (and leaving the examination for more study) would start a process of review based on demonstrated lack of knowledge in a specific area. Or, it would finally force the student to re-assess his career choice. In either case, as close as is possible to an absolute determination of a state of knowledge has been achieved in an unambiguous and unbiased manner.
What I am arguing for is a method of facing the student with his or her own lack of knowledge in a manner which is unequivocal. We do not exult in that lack of knowledge, and we do not hold it against the student.

In the PLATO system, it is easy to arrange that the system rarely repeat the exact questions again, so that a student who is required to pass a test before proceeding cannot escalate many failures into a pass. We can control that an entire group takes substantially the same examination without any single member of the group taking the exact same examination. Furthermore, the technology assures us of two very important side effects of this method of testing. First, the method is uniformly applied. There is no question of anyone having any advantage over anyone else. Second, the method is applied consistently regardless of time. This means that the computer doesn’t tire, and start grading easier the further along it is into the process.

It is easy to imagine an entire course posited on this method of testing. Notice, I am not advocating using PLATO to teach the material, only to examine on the material. By using deadlines to pass landmark examinations, one could successfully teach a course without ever falling prey to the numerous misunderstandings which normal testing brings into play. PLATO examination would be the closest thing to having “written-oral” examinations; that is, those in which one expresses himself in words, but in which the examiner’s responses are instantaneous.

Using PLATO, it would be possible to demand perfection in the “trivial error correction mode” for core material before allowing the student to pass, while changing the grading strategy to a more strict variant for optional material required to raise one’s grade above the mere pass.

(I need to note that PLATO is, to the best of my knowledge, now dead (2008). Computer Assisted Testing has been implemented in Perl (by me) and in other venues (see WebAssign.net as an excellent example.) What we lack is the proctoring that would make security less of an issue. Right now, there is rampant cheating with on-line assessment tools.

If we would like to assure the outside world that our education has produced meaningful results, then an objective measure of knowledge attained is necessary. Such a measure would not assure the“consumer of our products” that the students had acquired the proper education . But it would assure that the student had acquired the education that the institution thought was proper. Perhaps more important, such testing would allow the student to value himself more highly. An absolutely impartial examination that has been passed is a credit to the examinee.

As a final suggestion, and this has little to do with computer grading in real time, I would propose that material from previous lessons or courses that are really not learned become penalties to the student. The system should encourage retention of material. In return, we shouldn’t teach or require material that need not be retained.

This brings up a final point, one that is perhaps painful. If we examined students rigorously, and demanded minimum competence, then it is quite possible that we would decimate our classes. Therefore, the change over to such a stark method of examinations should also be a time when we re-investigate our course content to weed out those traditional subtopics which we may be fond of, but that could not morally be used as insurmountable barriers to progress in the field. With rigid examinations, we would be constrained to be examining on meaningful subjects.

1.5 What’s Wrong With This Method?

For a confirmed addict of tomorrow’s technology. I have to admit that the method of examination described does not solve all our problems. Specifically, it ignores derivations that, for those experienced in the field, are an indication of the amount of intellectual twisting that we have accomplished. One of our goals in physical chemistry is to teach the student to describe his or her ideas about a phenomenon in mathematical terms. Although most of our students cannot be made over into theorists, it is good to convince them that describing reality through equations is a do-able task. It is difficult to see how computer grading can lead to effective testing in this particular area of learning.

Also, PLATO is expensive, and with the advent of highly capable microcomputers and cheap secondary storage, it is clear that alternative technologies exist for carrying out the examination scheme outlined here. Unfortunately, this would mean “reinventing the wheel”, as the PLATO TUTOR language capabilities and system capabilities are outstanding tools for concentrating one’s attention on the job at hand, that is, composing lesson/test-ware .

Finally, there will always exist a subset of students who will be unable to deal with non-traditional testing methods, and it follows that traditional methods should be available to them. For some students, typing is an insurmountable chore, and they will not be efficient in front of a terminal. Others fear machines, and still others fear that they will break any machine they touch. So the machine age is not yet with us, even if these proposals are ultimately carried out somewhere .

On the plus side, there is another subset of students who claim that they “do not test well”, and for these students, opening up an alternative testing scheme might be a boon.

But my argument remains that quality education demands standards for students that are uniform, non-varying, and equitable. Computer grading of manipulative questions in chemistry, physics, mathematics, and so forth, would provide unambiguous proof for both the student and the teacher that knowledge was actually attained. Continuation of such a program between courses would allow for follow-up reinforcement of material which should be retained. The ultimate goal of such a program would be to stop graduating technical incompetents into the world. No social pressure of any kind could allow a student to pass through such a program if the student were unable to demonstrate learning of the required material. Consequently, when grades were investigated, one would be measuring by the categories “good”, “better”, and “best”, rather than by our present ambiguous standards.

1.6 Computer Guided Reading, A Logical CAT extension

With the failure of CAT, vide infra, it seems worthwhile to now discuss my other major initiative, Computer Guided Reading, which failed in the same manner, i.e., it was ignored. What follows is a discussion of what failed.

1.7 Introduction To Computer Guided Reading

Over the past few years, it has become apparent that certain skills which used to be a part of graduating college student’s armamentarium were becoming atrophied. In scientific/technical disciplines
1. the widespread introduction of open book or open notes or open formula list examinations,
2. the introduction of multiple choice testing (including the ACS examinations),
3. the introduction of symbolic mathematical programming into texts (and courses and soon into examinations), and
4. the enthronement of calculators (and demise of slide rules)
have given rise to a culture in the class room in which learning derivations in Physical Chemistry courses is silly, but learning the “plug and chug” nuts and bolts of doing standardized problems becomes the sine qua non of undergraduate achievement .
Reading assignments of texts are routinely ignored, and students treat their “homework” as doing assigned problems by searching for and cloning the nearest exemplar in their texts.
More and more our students are patiently and politely listening to lectures where academics derive complicated relations, knowing that this is a form of intellectual masturbation for the academics which has no relevance whatsoever to what real “scientists”, “doctors”, “lawyers”, “engineers”, etc., what ever, really do.

Science, from their point of view, appears to consist of religiously accepting that the formulas they’ve been shown are correct, and that doing science consists of using these formulas. For the vast majority of students, the thought of ever doing an original derivation is inconceivable. Students seem willing to accept things, like the Second Law of Thermodynamics, and want to push ahead. There is little doubting, and little intrinsic faith that they, our next generation of scientists, will be called upon to create new equations for new, as yet unknown, phenomenon.

To a certain extent, they believe that all that lies ahead of them is using computer programs to do something or other about which they are not too clear. But if something comes out of a computer, it is right, to them.

1.8 The challenge

If one asserts that reading science is different than reading Shakespeare, then it is necessary to clarify that difference in a manner which distinguishes between these two kinds of prose offerings. One of the distinguishing characteristics, which one easily notes, is that there is no need to verify anything written by Shakespeare. What ever is on the page is OK. It might be that what is written wasn’t written by Shakespeare, i.e., that someone is lying to the student in some way, but even that need not be terribly important.

On the other hand, scientific writing demands of the reader that s/he verify that every non-trivial assertion (of a scientific kind) be supported or prov-en, or be part of the hypothesis set which drives the prose (and mathematics) for-wards. Scientific writing is an attempt to convince the reader that what is written is true. Almost never is it intentionally misleading (although, of course, misleading is in the eye of the reader). Almost always, the author is attempting to convince the reader that what has been written is correct, i.e., a reflection of some kind of objective reality. Brevity easily results in unintended confusion.

But, if one opens a journal, say the Journal of Chemical Physics as an example, one is astounded to realize how many algebra and calculus manipulations have been omitted between adjacent virtually contiguous equations in most (if not all) of these manuscripts. In fact, the brevity of the manuscripts is part of the bravura nature of scientific writing, requiring that the reader fill in the details. Authors assume that the details are (possibly) trivial, and not worth the valuable space on the printed page.

1.9 Scientific Writing and Teaching Scientific Reading

Assuming that all our technical graduate professionals will eventually read journals and have to learn from them, it seems important to make sure they they learn how to read in the critical style which is necessary in reading mathematics based text. In those texts, equations, interspersed with words, are supposed to lead the reader; but the reader is still expected to be an active one, filling in missing details as necessary so that nothing, at the end, is taken on faith.

In thinking about this problem, of how to get students to read “with a fine toothed comb”, validating each and every equation themselves, I have created a computer guided reading (CGR) scheme which prevents students from “turning the page” and continuing on, when there is something on the page which they do not understand. The thought is that if they could develop the habit of checking what they read as they read, then it would carry on in future life, when text books and teacher/authorities are absent and they are required to learn from the written (journal) page on their own.

1.11 Testing and Reading, a Challenge

In this work, and in my Computer Assisted Testing (CAT) work, the emphasis has been on making a precise measurement. When the computer is doing the “grading” or “judging”, it is imperative that that act be as perfect as possible, especially when the vast majority of students are going to accept the word of the computer as the absolute truth. Little do they suspect that the programming which lead to the information on their screens is a flawed as the human who “inputted” the information, but that is another subject.

Accuracy of grading becomes imperative especially when one considers the ease with which one can (mistakenly, but possibly innocently) mislead a student, and thereby cause great (inadvertent) harm, without ever knowing what harm one has done! Consider the following multiple choice examination question: " A ball is thrown up, comes to rest momentarily, and then falls back down. At its highest point, its velocity is:
1. equal to its displacement?
2. equal to its displacement divided by time?
3. at a minimum?
4. at a maximum?
In a recent discussion amongst physics teachers, this question was discussed, and noted to be flawed in several respects. First, and most distressing to some is the didactic statement that the ball comes to rest momentarily. This is utter nonsense, and frightening, since the better student will surely become discombobulated by this erroneous statement, and loose time, focus, and who knows what else, struggling with this piece of silliness, which was intended to lead the student to the “right” answer.

Forgetting the misstatement in the problem, some students will certainly by mislead into either skipping the question, guessing, thinking about what the teacher (examiner) intended, etc., rather than the physics morsel of knowledge that the question is intended to elicit. Since the velocity will be negative on the second half of the trajectory, then at the first bounce, when the ball has hit the floor, its velocity will be maximally negative, and therefore “maximum” might be a right answer. But it isn’t what the machine expects, and any grading scheme, using computers, templates, or just human eyes, when reading the students response, will be forced to mark it “right” or “wrong” incorrectly, since the question is grotesquely flawed.

The question has to be posed “perfectly” if its “objectivity” is to be exploited. Thus, assuming the offending "comes to rest momentarily" clause is removed, it still remains for the examiner to change velocity to speed and define it (so there is no question of confusion) i.e., the speed, which is the absolute magnitude of the velocity.

In CAT and CGR, it is preferable to ask the student “what is the speed when the ball has achieved its greatest height? ”, and allow a free numerical response (which should be zero). There are no misunderstandings in this case, and all is perfectly clear and not disconcerting, to the student. And if s/he doesn’t know the answer, then the measurement is perfect! (I should note that zero answers are not the best, since these are prone to guesswork which is what we're interested in eliminating.)

1.12 Results and Discussion

This method has failed abominably. "The best laid plans of ..." . I only report it here because rarely are “educational experiments” reported unless their results are glowing. A change might be appreciated.

Granted, only a tiny subset of students used the CGR system for a graduate class in quantum chemistry. But one of the students, after passing the course, returned two years later with a set of corrections to a document that he had claimed to have read as a student. This time, while preparing for his oral examination (in the same field, hence the review of this topic (moments of inertia)), he found several errors. I corrected his errors and asked him to check the revised manuscript again, and, he found more errors (truly typographical) but missed a technical set of errors in equations. That means that he still was not reading in the manner we are supposed to. I conclude that CGR in a classroom situation is not effective in making students read “with a fine toothed comb”.

In fact, I concluded that nothing will work. It appears as if we have changed the ethos attached to learning and scholarship, converting attendance at institutions of higher education into pre-employment certification. Learning for its own sake doesn’t exist any more, if it ever did. Rather, our students are targeted at doing the minimum work required to “get by” passing over, under, or through the barrier to their intended goal.

Recently, in the Times, there was a letter from a business school graduate bemoaning the fact that his class, in seeking to “beat the curve”, could never be expected to compete on a world level, since their only interest was in slightly better than average performance relative to their local cadre. How true.

We have lost our way. Learning, for excellence, doesn’t exist, only its trappings.

Qvatch- random thoughts possibly worth sharing

Wednesday, November 28, 2007

Computer Assisted Testing, Almost the Original Proposal

No comments:

Blog Archive