All Correction, All the Time: Is Written Error Correction Worth the Effort?

Studies reviewed:

Evans, N., K.J. Hartshorn, R. McCollum, & M. Wolfersberger. (2010). Contextualizing corrective feedback in second language writing pedagogy. Language Teaching Research, 14(4), 445 -463.

Hartshorn, K.J., N. Evans, P. Merrill, R. Sudweeks, D Strong-Krause, & N. Anderson. (2010). Effects of dynamic corrective feedback on ESL writing accuracy. TESOL Quarterly, 44(1), 84-109.

Evans, N., K.J. Hartshorn, and D. Strong-Krause. (2010). The efficacy of dynamic written corrective feedback for university-matriculated ESL learners accuracy. System, 39, 229-239.

In the world of language teaching, some ideas die hard.  Error correction is one of those ideas that keeps rising from the dead after each seemingly fatal blow.  Krashen’s 1984 book, Writing, reviewed the evidence on error correction and found that it rarely did what it promised to do – improve students’ grammatical accuracy.  Subsequent papers by Krashen and a series of excellent critiques by John Truscott  (1996, 1999, 2004) put additional nails in the coffin. But error correction lives on, with plenty of defenders in the applied linguistics field.

I look here at three recent studies from researchers at Bingham Young University that take error correction to its logical conclusion – correcting nearly every mistake students make in every essay throughout an entire semester.  None of the studies, in my view, show that error correction was worth the massive effort that was invested in it.  I start with descriptions of the studies, then comment on them below.

Evans, Hartshorn, McCollum, & Wolfersberger (2010)

Participants: Two groups ESL students described as “advanced low” studying in an intensive English program (IEP) for a 13-week semester.  (This was apparently the pilot study for the next two studies, and contained no control group.)  The groups (N = 12 and 15, respectively) were from different academic semesters, Winter 2007 and Summer 2007, and students ranged in age from 18 to 33.

Treatment: There were six steps in the rather lengthy error correction treatment, which the authors refer to as “dynamic Written Corrective Feedback” (WCF):

  1. Each group wrote 10-minute, timed essays at the beginning of most class sessions (the class met Monday through Thursday).  They were told to “follow the conventions of good paragraph writing, be as linguistically accurate as possible, and make the content substantive” (p. 455). Topics were assigned for each essay.  On average, the groups in this study wrote 31 paragraphs throughout the semester.
  2. The teacher collected the essays and then corrected them, returning them the next class session to the students.  Written feedback consisted of “marking the papers for lexical and syntactic accuracy” using 20 error-correction symbols.  Citing work by Ferris, the researchers state that the teacher indicated but did not correct “errors students can treat – those that can be corrected with systematic grammar rules” with one of the 20 symbols. But the teacher also did “direct error correction,” actually correcting the errors that students were thought unable to treat, described as “those that result from aspects of the language that must be acquired over time, such as prepositions or some lexical features” (p. 455). Then, the teacher assigned a “holistic” grade, with 75% for accuracy and 25% for content.
  3. The coded/corrected paper was returned to the student during the next session.  The student, in turn, had several additional tasks:
    (a) Keep a tally of errors by type of error;
    (b) Keep a list of all errors committed in context by category, which consisted of errors “typed exactly as they were originally and erroneously written” (p. 455);
    (c) Highlight or underline each error on their typed list of errors; and finally
    (d) Edit, type, and resubmit the paragraph to the teacher for a second review.
  4. The teacher then marked the typed paragraphs again for accuracy, but this time indicating the place where the error occurred with a check mark, a circle, or by underlining, although the teacher could, if needed, also supply the error code again. The papers were then returned to the students.
  5. Students corrected (if necessary) their second draft and returned it to the teacher.
  6. The teacher corrected and indicated the errors again, if necessary, and the process repeated until the paragraph was error free.  The researchers note that most paragraph were error free within two drafts.
Measure:  Essays were grouped chronologically into four sets for analysis.  The holistic scores given by the teacher were added and averaged for each set. In addition, writing accuracy was measured by calculating the ratio of error-free clauses to total number of clauses written. The first and last sets’ scores were compared to measure increases in accuracy.

 

Results:  Table 1 shows the results for both the holistic and the error-free clauses measures, showing just the first and last sets of both groups (from Evans et al. Tables 1 and 2, pp. 457-458).

 

Table 1: Holistic Scores and Percentage of Error-Free Clauses in Evans et al. (2010)
HolisticError-Free Clauses
Group1st Set of Essays4th Set of Essays1st Set of Essays4th Set of Essays
Winter 20077.397.8645%55%
Summer 20077.457.6942%54%
Students improved on their holistic scores from the first set to the fourth set  for both the Winter 2007 (t = -6.79, df = 11, p<000) and Summer 2007 (t = -4.9, df = 9, p=.001).  Effect size was measured by partial eta square, which were .81 and .73, respectively.  According to Cohen (1988), these are large effects.  Cohen’s d, another, more common measure of effect size, was not calculated, and standard deviations were not provided.  Wolf (1986) provides us with a formula for calculating d from the t statistic, which here yielded substantial d’s of 4.1 and 3.3 for the increase in the holistic (though largely based on accuracy) scores.

 

Subjects also made apparently large gains on the ratio of error-free clauses to total clauses (Winter 2007 group: t = -3.42, df = 11, p = .006; Summer 2007 group: t = -3.90, df = 9, p = .004). Effect sizes reported by the researchers were partial eta squared (.52 and . 63 for the Winter and Summer groups, respectively).  Cohen’s d calculated from the t statistic were also large, with d =  2.06 for the Winter group and 2.6 for the Summer group.

 

Hartshorn et al. (2010): 

Participants: 47  low- and mid-advanced university Intensive English Program (IEP) students (mean age: 24 years) of various language backgrounds.

Treatment:  Students in the treatment group underwent substantially the same process of total error correction as in the previously discussed study of Evans et al.: 10-minute essays in each class, errors coded by the teacher, tallied and corrected by the student, and so on until the essay was free of errors.  In addition, the teacher discussed common errors in class.

The control group participated in a “process writing” approach during the 15 week period, during which, however, errors were also corrected. They did not do short, timed compositions like the experimental group, but wrote four multi-draft term papers and received feedback on each draft.  The total amount of writing, according to the researchers, was roughly the same.

Measure: The pretest and posttest consisted of 30-minute typed essays rather than the in-class essays.  Writing accuracy was measured by the ratio of error-free T-units to total T-units. Also measured were “rhetorical competence” with a rubric adapted from the TOEFL iBT; writing fluency, defined as the number of total words; and writing complexity, defined as the mean number of words per T-unit.

Results: The treatment and control groups did not differ significantly on the rhetorical competence, fluency, or complexity measures, although the contrast group wrote slightly more than the control group (Cohen’s d is not reported; partial eta squared was .07). The treatment group outscored the contrast group on accuracy, however, by what appears to be a wide margin.  The researchers somewhat confusingly report the results as whole numbers, although they are in fact ratios of error-free T-units to total T-units. Results are summarized in Table 2, but using percent of error-free T-units for the pre- and posttest calculated from the ratios given in study’s Table 4 (p. 99).

Table 2: Percent of Error-Free T-units for Experimental and Contrast Groups in Hartshorn et al. 

PretestPosttestDifference
Experimental14.02%
(15.0)
24.16%
(19.46)
10.14
Control16.3%
(10.7)
13.78%
(11.81)
-2.52

Effect size: partial eta squared = .21; Cohen’s d = .64 (calculated from Table 4, p. 99, using pooled standard deviation)

The researchers conclude that WCF was effective in significantly improving writing accuracy, noting that the effect size as measured by partial eta squared was large (using Cohen’s (1988) guidelines). Cohen’s was not calculated in the original article. It was .64, which is considered by Cohen a medium effect. In terms of the practical significance of the results,  Hartshorn et al. note that the treatment group’s writing was “just over 75% more accurate than the writing of the contrast group.”

Evans, Hartshorn, & Strong-Krause (2010):

Participants: Both the treatment group (N = 14) and the contrast group (N = 16) were university ESL students (mean age: 21).

Treatment: Both treatment and control groups received essentially the same semester-long curriculum as the treatment and control groups of Hartshnorn et al. (2010) and Evans et al. (2010).  The treatment group received extensive error correction, including having to code, tally, and log all of their errors, as well as rewrite their 10-minute in-class essays over until there were error free, a process that the researchers note “many times…required multiple drafts” (p. 234).  The 10-minute essays were written “three to four times per week” over the 13-week semester, totally about 19 pages of “polished” writing.  The contrast group received a process-writing approach, although their errors were also corrected, but not as consistently or extensively as the treatment group.

Measure: Fluency and complexity were measured in a similar manner as in Hartshorn et al. (2010) (number of total words written and mean number of words per clause).  As in Evans, Hartshorn, McCollum, and Wolfersberger, the accuracy measure was changed from the ratio of error-free clauses over total clauses, since it was thought to provide a more sensitive measure of change than T-unit analysis.

Results: Both fluency and complexity of the treatment group was slighly worse than the contrast group, although the researchers report that the effect sizes were relatively small (partial eta squared .04 and .06, respectively; Cohen’s d was not reported). On the measure of accuracy, the treatment group outscored the comparison students.  Results are reported in Table 3, from Evans et al.’s Table 3 (p. 235).

Table 3: Percent of Error-Free Clauses for Experimental and Contrast Groups in Evans et al.

PretestPosttestDifference
Experimental47.10%
(11.2)
57.80%
(10.9)
10.7
Control51.40%
(12.6)
50.30%
(19.4)
-1.1

Effect size: partial eta squared = .16; Cohen’s d = .48 (calculated from Table 3, p. 235)

Comments:

1. As with many studies on error correction and “focus on form,” the conditions for Monitor use appear to be met during the elicitation measures of all three studies. The Monitor (Krashen, 1982) is our ability to use conscious knowledge (learning) of the language (such as grammar rules) to make our speaking and writing more accurate than what it might otherwise be. To use the Monitor, you must (1) know the rule you need to apply in the given situation, (2) be focused on the form or accuracy of the sentence you are producing, and (3) have time to use the rule.  These conditions are difficult to meet in the real world, but, when the conditions are met, the Monitor can improve our accuracy. In these three studies, as we would expect, the students who knew the rule, were focused on form and accuracy, and had sufficient time to use their conscious knowledge (learning) were more accurate than (in the two studies with control groups) students less focussed on form and presumably less knowledgeable of the rules.

The results of these studies, then, are perfectly consistent with current second language acquisition theory (Krashen, 2003) on the use of conscious knowledge in language production.  There is little doubt that after 13-15 weeks of massive error correction and perhaps as many as 30 hours spent focussing on form, not only in class but in their homework, students already oriented to grammar instruction and error correction (university ESL/IEP students) were able to use their conscious knowledge during the assessments.  While the essays were timed in all three studies, the 10-minute essays in the first study were very short pieces of writing in which accuracy was heavily stressed (the sample student essay included in the article is only eight sentences and less than 100 words long, although no measures of fluency were reported to know how typical this sample was). For the other two studies, 30 minutes would be for most students sufficient time to focus on the grammatical correctness of their writing.

In addition to meeting the conditions of the Monitor, the treatment groups were also very practiced in timed essay writing, having done so 30 or more times during the semester.  There is no indication that the contrast groups did any timed writings, and were thus likely to be less practiced at writing under such conditions.

2. As the researchers point out, this was not a comparison of error correction and no error correction, but (in studies 2 and 3 above) extreme corrective feedback with more traditional error correction. Other studies, such as Syying Lee’s work on extensive reading and writing (Lee & Hsu, 2009) have found accuracy improves significantly simply through more reading, without the extensive (and time-consuming) WCF used here.

3. The studies lacked any delayed post-test.  This is a crucial element in any study on the effects of form-focused instruction, as previous studies which have included a delayed post-test have nearly always found a sharp decline in the gains demonstrated on the immediate post-test. The effect of explicit instruction typically fades, although Krashen (2003) notes that it can take up to several months for the declines to show up (p. 42-43). This weakness of all three studies is alone reason to question the researchers’ optimistic assessment of the benefits of corrective feedback.

The gains that were found were indeed large by Cohen’s guidelines for partial eta squared, but, for the last two studies, somewhat less so when calculated from the mean post-test scores (d = .64, a medium to large effect for Hartshorn et al., d = .48, a small to medium effect for third study).

4. Unlike the researchers, I do not find the practical effect of this huge investment of time in error correction very impressive.  In the first study, students went from having 55-58% of their clauses with errors to 45-46% with errors. In Hartshorn et al., treatment group students went from having 86% of their T-units with errors to “only” 76% error-filled T-units.  In the third study, students in the treatment group went from having 53% of their clauses with errors to 42% with errors, all this after painstaking and massive error correction efforts.

While this is an improvement, it is dubious whether teachers would think the considerable effort involved in carrying out this “all-correction, all-the-time” agenda in their own classrooms worth it for such a small result. Remember that the teachers coded nearly every error in the essays from nearly every class period, in addition to having to re-code/correct follow-up essays. Students had to re-type the errors, tally them, classify them, underline the error, type and correct the essay, then repeat the process if the essay wasn’t error free. No estimates are given of how much time the teachers and students collectively spent on correcting errors, but is appears to be substantial. These of course are, as pointed out, students coming from a population already heavily geared toward grammar study, making them ideal candidates for this sort of treatment. To leave the semester still making so many errors can hardly be claimed as a victory for error correction.

Works Cited

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed). Hillsdale, NJ: Lawrence Erlbaum.

Krashen, S.D. (1982). Principles and practice in second language acquisition. Pergamon Press.

Krashen, S.D. (1984). Writing: Research, theory, and applications. Torrance, CA: Laredo Publishing.

Krashen, S.D. (2003). Explorations in language acquisition and use: The Taipei Lectures. Portsmouth, NH: Heinemann.

Lee, S. Y., & Hsu, Y. Y. (2009). Determining the crucial characteristics of Extensive Reading programs: The Impact of Extensive Reading on EFL writingThe International Journal of Foreign Language Teaching (IJFLT) , 12-20.

Truscott, J. (1996). The case against grammar correction in L2 writing classes. Language Learning, 46, 327-369.

Truscott, J. (1999). The case for “The case against grammar correction in L2 writing classes” A response to Ferris. Journal of Second Language Writing, 8, 111-122.

Truscott, J. (2004). Evidence and conjecture on the effects of error correction: A response to Chandler. Journal of Second Language Writing, 13, 337-343 .

Wolf, F. (1986). Meta-analysis: Quantitative methods for research synthesis. Newbury Park, CA: Sage Publications.