Research article
Open Access

CanMEDS Competencies in Family Medicine Residents: Can Criterion-Based Assessment Improve the Quality of Teacher Feedback?

Caroline Simard[1], Luc Côté[1], Laurie de Bruyn[1], Miriam Lacasse[1]

Institution: 1. Université Laval
Corresponding Author: Dr Miriam Lacasse ([email protected])
Categories: Assessment, Learning Outcomes/Competency, Postgraduate (including Speciality Training)
Published Date: 19/01/2021


Background: The Université Laval family medicine program has developed an innovative computerized tool called the criterion-based Competency Assessment Tool (CAT), currently undergoing validity assessment.

Methods: This study followed a qualitative design assessing written comments collected in the assessment reports from the cohorts before and after the implementation of the CAT (npre = 200, npost = 200) in order to ascertain the tool's consequence validity. A deductive thematic content analysis was performed and pre- and post-implementation cohorts were compared.

Findings: Overall feedback quality does not appear to have changed between cohorts. When analyzing CanMEDS roles separately, each is covered more often, but related comments appear to be less specific. The new report also seems to enable the teacher to tell more with the same number of words.

Conclusions: Perhaps since the items are complete, exhaustive, and detailed enough to be self-explanatory, the tool helps the teacher to cover a wider area of competencies without the need to add many details with narrative comments. Consequence validity does not seem to have been substantially affected by changes in the family medicine resident's competency assessment, but the results do not support the contention that comment quality has improved either.

Keywords: Computed-based feedback; feedback; criterion-based assessment; family medicine residency.


Feedback is the cornerstone of medical education or at the very least, it is central to improving a resident's performance. When feedback is delivered in a ‘state of the art’ manner, its virtues can be very beneficial indeed. It especially helps the recipient regulate his or her learning through improved reflection (Ericsson et al., 2006; van den Boom et al., 2004) and greater engagement (Burgess and Mellis, 2015; Price, Handley and Millar, 2011). By reducing the "gap between what is understood and what is aimed to be understood" (Hattie and Timperley, 2007, p. 82), feedback also helps the resident to detail his or her own strengths and limits, as well as what to learn and how it can be achieved (Hattie et al., 2007; Norcini and Burch, 2007). Contemporary educational models place emphasis on the interactive aspect of the relationship between the feedback provider and its recipient and on the collaborative construction of knowledge (Ossenberg, Henderson and Mitchell, 2019). Telio, Ajjawi and Regehr (2015) proposed the learner-centered approach of educational alliance, which corresponds to the learner's perception of his or her trainee's authenticity and commitment to a learning process, involving "seeking shared understanding of performance and standards, negotiating agreement on action plans, working together toward reaching the goals, and co-creating opportunities to use feedback in practice" (p. 612). Bing-You et al. (2018) used the metaphor of the “tango dance” to illustrate the complexity of the partnership between the clinical teacher and the student. In return, this educational dialogue encourages attitudes of feedback recognition, usage, and seeking in learners (Bowen, Marshall and Murdoch-Eaton, 2017). In this context, while quality feedback contributes to educational alliance, the latter help foster adherence to what has been conveyed.

However, feedback quality is far from being uniformly satisfying (Bing-You et al., 2018). When feedback fails to be constructive, and especially if clinical teachers’ comments are non-specific and non-descriptive, it can hinder learning or cause negative outcomes, such as uncertainty, questioning teachers’ credibility, a reduction of receptiveness or adherence, and mostly a decision to avoid seeking feedback (Archer, 2010; Baron, 1988; Telio, Ajjawi and Regehr, 2015; van de Ridder et al., 2015).

The consensus is well established in medical education concerning what constitutes effective feedback, leading to several recommendations (Archer, 2010; Boehler et al., 2006; Brehaut et al., 2016; Brinko, 1993; Canavan et al., 2010; Chur-Hansen and McLean, 2006; Ende, 1983; Ghazal, Gul and Khowaja, 2015; Jongekrijg and Russell, 1999; Lefroy et al., 2015; Molloy et al., 2018; Ossenberg, Henderson and Mitchell, 2019; Parkes, Abercrombie and McCarty, 2013; Saraf et al., 2014; Watling and Ginsburg, 2019).

  • Feedback should first be relevant to the recipient. It should be criteria-based, linked to personal goals meaningful to the learner, related to well-defined goals and performance indicators, and explicit regarding whether the learner met the standards. In addition, the information should be factual and based on remediable behaviours. Advice should be future-oriented instead of dwelling on the past.
  • Secondly, constructive feedback should be specific. Specific feedback is focused on tasks, concrete information, and behaviors. It should not be based on inferences, generalizations, judgements, or on the learner’s character.
  • Thirdly, feedback should include pathways for improvement. Providing suggestions, limits, and omissions offers guidance and facilitates one’s performance review. Balancing both positive and negative comments is beneficial to feedback constructiveness, though the strict usage of the popular "feedback sandwich" technique (positive-negative-positive feedback; LeBaron and Jernick, 2000) does not appear always to be successful in increasing performance. Molloy et al. (2018) suggest instead to prioritize feedback literacy and help the learner develop his or her capacity to "anticipate emotion and manage emotions in relational activities, particularly when there are disparate perspectives on performance between the learner and the "other", whether that is the teacher, peer, or patient" (p. 2). While attention should be focused on preserving self-esteem, feedback providers should also identify strengths and reinforce achievements, without, however, offering praise. Awareness should be raised concerning the tone of feedback, depending on whether it is presented in terms of reflection, praise, criticism, or suggestion. In the end, instead of focusing on a one-way "sandwiched" feedback or on the feedback delivery, the clinical teacher should focus on the interactive process and orient a dialogue toward both strengths and pathways for improvement in order to stimulate reflection and self-assessment.

However, the mere "mechanical" application of the aforementioned recommendations is also detrimental to educational alliance and adherence to feedback pathways. They should be conceived in the broader context of a dialogue and interactive process. In light of these recommendations, previous works investigated whether feedback was effective in improving physicians’ performance. Veloski et al. (2006) conducted a systematic review of 41 studies measuring physicians’ performance before and after they received feedback. Only 74% of studies ended with positive results in terms of performance. In a following meta-analysis, Mansouri and Lockyer (2007) retrieved 31 studies concerning interventions in continuing education aimed at improving physicians' performance. Only one article included a written feedback intervention and the effect on knowledge and performance was not significant. Bing-You et al. (2017) reviewed articles about feedback for learners in medical education and only a few articles compared performance before and after feedback. Moreover, studies on feedback applied to medical education are even rarer, and most rely on less rigorous methodologies (Bing-You et al., 2017). In the context of these partial or mitigated results, ongoing efforts must be made to better understand the virtues of feedback and especially the characteristics of constructive feedback, as well as how it can be influenced by assessment tools.

The current study

Université Laval is one of three francophone faculties of medicine in the province of Quebec, Canada. Its family medicine residency program developed an innovative electronic tool, the criterion-based Competency Assessment Tool (CAT ; Lacasse et al., 2017), in response to the new physician competency framework established by the Royal College of Physicians and Surgeons of Canada and the College of Family Physicians of Canada, namely the CanMEDS (Frank, Snell and Sherbino, 2015).

This tool is based on rubrics that represent program expectations with regard to achievement of various CanMEDS-MF competencies in the course of residency and specifies the time intervals for each supervision level (or independence level). The assessed competencies and expected performance for each of these differ depending on the rotation and the training period. The CAT can suggest educational diagnoses and prescriptions based on the assessment result so as to support resident feedback and guide clinical faculties in their decisions on the rotation outcome. The CAT is further detailed in the Measures section.

The written feedback it comprises aims at helping students glean an accurate portrait of their progress throughout the program. In this context, it is essential to verify the quality of the feedback given and whether it has improved with the implementation of the new assessment tool.


This cohort study compares the written comments on the assessment reports concerning residents' competencies before and after implementation of the CAT. We used a mixed-methods convergent study design (Sunderji and Waddell, 2018) to ascertain whether the CAT is associated with a change in the quality of written comments included in the resident assessment reports. The research was approved by the Université Laval ethics committee and the faculty direction.

Data collection

Assessment reports were provided by the Université Laval family medicine residency program for the 2013-2015 pre-implementation cohort (n =1037) and the 2016-2018 post-implementation cohort (N = 1380). For comparison purposes, 200 reports per cohort were selected among those that included comments (Npre = 1025, 98.84%; Npost = 1115, 80.80%). All reports identifying residents as “in difficulty" in the capstone were included (npre = 8, Npost = 25), and then the rest were chosen randomly, respecting a quota of 100 reports for residents completing the first year of residency (R1) and 100 reports for R2 (second year of residency) that is, for each cohort.


The CAT is an online assessment tool comprising 34 competencies (indicators) grouped in seven roles: medical expert, communicator, collaborator, leader, health advocate, scholar, and professional. This new criterion-based competency assessment tool (CAT) takes into account both the program expectations in terms of competency development across time and the resident’s progress achievement. This summative instrument is generally used monthly or bimonthly by clinical teachers. As the CAT is an electronic-formatted tool, it helps clinical teachers in their assessment by adding examples of behaviours describing observable manifestations of the CanMEDS associated with each competency in the form of a notice that appears during a mouse hover. These examples aim to help clinical teachers in assessing the resident more objectively. The CAT also makes it possible to write comments about each CanMEDS role. Finally, it guides clinical teachers in determining the issue of the rotation, namely whether the resident passes, fails, is considered in difficulty, or is put on probation.

Qualitative coding

Clinical teachers’ feedback in each assessment report was qualitatively and deductively coded independently by researchers (2013-2015 cohort: CS and LDB; 2016-2018 cohort: CS, LC, and LDB) based on the following three criteria described in the aforementioned literature:

Relevance. Firstly, constructive feedback must be relevant, suggesting that the evaluated (and commented) elements are part of the assessment report or are directly related to it. In the current study, relevance was operationalized as follows: The comment refers explicitly to the definition elements of the CanMEDS roles and to the assessment report's content.

Specificity. Secondly, constructive feedback must be specific. It must also be based on facts or actions and anchored in descriptive data, instead of interpretations and presumed intentions. In this study, specificity was operationalized as follows: The comment refers to specific facts or behaviors related to the resident's performance aligned with CanMEDS roles rather than general and interpretative aspects.

Presence of pathways for improvement. Thirdly, feedback is constructive when areas of improvement are provided. To maximize feedback efficacy, the proposed pathways for improvement should ideally be accompanied by suggestions about how to achieve the desired improvement. In this study, the presence of pathways for improvement was operationalized as follows: The report comments include at least one area of improvement, which can be mentioned as a specific educational prescription or as a limitation to be corrected.

Relevance (abridged as "R") and Specificity (abridged as "S") were coded for each comment, while presence of Pathways for improvement (abridged as "PI") was coded throughout each report. The PI score was generalized in this sense because pathways for improvement can be suggested intrinsically without necessarily being present in every comment and without undermining the quality of feedback received through the assessment report. We transformed qualitative data into quantitative data by attributing a score of "1" when the criterion was present in a given comment, and a score of "0" in its absence, resulting in a dichotomous variable.

As there are discrepancies between the number of competencies (and comments) as regards pre- and post-implementation reports, the two principal researchers (CS and ML) linked the pre-implementation comments for each CanMEDS role they referenced (discrepancies revised by ML). Afterwards, three coders (CS, LC, and LDB) analyzed the presence/absence of R, S, and PI in the 30 first reports before meeting to discuss ambiguities and potential adjustments to coding. The process was repeated with 30 other reports. After the two calibration meetings, they independently revised the first 60 reports, and then analyzed the other 140. Subsequently, the 200 pre-implementation reports were independently analyzed by CS and LDB. For further analysis, Valence was also coded (i.e. if the comment is 0 = negative; 0.5 = neutral; or 1 = positive).

Score computation

The initial analysis led to a Relevance and Specificity score for each CanMEDS role and each rater. Divergences between coders were identified. In the pre-implementation cohort, the first author revised both raters' scores to determine the final score when a consensus was not reached. In the post-implementation cohort, because there were three raters, the final score was determined based on the majority of responses. In other words, the option (presence or absence) was established according to the highest number of votes (i.e. two versus one). To get only one Relevance and one Specificity score per report, average values were computed (the average of each role score). Afterwards, a feedback quality score (FQS) was computed through averaging the mean R, mean S, and PI. Hence, scores varied from 0 to 1 and were reported in percentages, thus facilitating interpretation.

Quantitative data analysis

Inter-rater reliability was verified with intra-class correlation (two-way mixed, absolute agreement). Internal consistency was verified in order to obtain an interpretable feedback quality score derived from R, S, and PI. Chi-squared testing was performed to see whether there were differences in CanMEDS coverage between cohorts. Analysis of variance (ANOVA) was performed regarding pre-FQS (2013-2015) and post-FQS (2016-2018) to ascertain whether differences in feedback quality have emerged since implementation of the CAT, this being the main purpose of the study. To document the context of feedback quality, the three feedback quality criteria were correlated with comments' valence and residents' performance. Performance is first represented by CAS (competency acquisition score, which is in simple terms a percentage of mastery in all the CanMEDS competencies), but this score is only available for the 2016-2018 cohort. The CAS has the advantage of making results uniform across stage types and of weighting competencies according to their relative importance for each stage. However, as this score is based on a very strict specific competency set for each stage, it does not consider partial success and it treats missing data as a failure. Its use being contextually appropriate, the CAS might engender bias in the current study. In this regard, we also correlated feedback quality criteria with percentage of success in competencies (PSC score), which could also be available for the 2013-2015 cohort. Both CAS and PSC vary from 0 to 1 and are based on the final competencies score; hence they should be somehow comparable. However, the PSC excludes missing data and average available data. Because scores vary from 1 to 3 in the 2013-2015 cohort and from 1 to 4 in the 2016-2018 cohort, they were recoded between 0 and 1 before being averaged. As concerns verifying whether feedback quality was different if the students experienced isolated competency achievement delays (if concerns were raised about their progress), or if they were classified as being in difficulty or as failures, inferential statistics could not be computed due to small sample sizes. Still, descriptive statistics were computed. Finally, as residents' programs were grouped into six categories, we compared the three quality criteria among programs through ANOVA. When the equal variance assumption was not met, Welsh's adjusted F was computed (Liu, 2015; Tomarken and Serlin, 1986). Post-hoc analyses were performed using the Bonferroni correction, or in the case of unequal variance, using the Games-Howell test.


Among the 400 reports analyzed, 131 involved male students' (32.8%), and 263 involved female students, with 6 being gender unspecified (65.8%; with 26 reports missing data concerning a pair of residents in the post-implementation cohort = 1.5%). A total of 88 residents were represented in the 2013-2015 cohort reports (of 101 residents in the cohort), and 95 residents were represented in the 2016-2018 cohort reports (of 106 residents in the cohort).

In order to illustrate the criteria and their application, we provide in Table 1 a few examples from the CanMEDS role expressed as "medical expert" (freely translated from French to English by the authors), where some criteria are present or absent.

Table 1: Comments extraction to illustrate feedback quality criteria

Comments (with original French version)  R   S   PI 

Professional role


"Is constantly ready to learn. Good attitude, favourable to exchanges."

"Se met en position d'apprentissage. Belle attitude favorable aux échanges."

1 1 1

"Adequate knowledge for the training level."

"Connaissances adéquates pour le niveau.

0 0 0

Knows her limits, but difficulties are too important for this rotation level."

"Connait ses limites, mais limites trop importantes pour niveau de stage."

1 0 0

"Wants to learn, professional approach, is committed to the well-being of patients."

"Veut apprendre  approche professionnelle, a à cœur le bien-être de ses patients."

1 0 0

"Great clerk orientation and supervision."

"A très bien orienté et supervisé un externe."

0 1 0

"Takes initiatives in managing the patient"

"Prends l'initiative de la prise en charge"

0 1 0

Scholar role


"Emergency decisional algorithms are not always mastered (e.g., child with a fever), which places [the resident’s] knowledge at a below average level."

Original French: "Algorithmes décisionnels en lien avec l'urgence ne sont pas toujours acquis (fièvre chez l'enfant...) ce qui le positionne en-dessous de la moyenne au niveau des connaissances."

1 1 1

Note. Original French version is in italic. R: Reliable; S: Specific; PI: Presence of pathways for improvement. 0 = absence of criterion; 1 = presence of criterion.


Interrater reliability was relatively solid regarding overall feedback quality in both the pre- (2013-2015) and the post-implementation (2016-2018) cohorts with intraclass correlations of .81 (p = .00) and .83 (p = .00), respectively. It was also very good for separate criteria with .73/.74 for Relevant (pre/post); .77/.79 for Specific; and .90/.92 for Pathways for Improvement. Internal consistency was also tested to determine whether the three feedback quality criteria are homogeneous and whether merging them into a single FQS score would be reliable. Internal consistency (Cronbach's alpha) among the three quality scores (Reliability, Specificity and Pathways for Improvement) is.41 for the 400 reports; .36 for pre-implementation reports; and .46 for the post-implementation cohort, while the expectation is an alpha exceeding .70  (Nunnally & Bernstein, 1994). Those low coefficients suggest that the three quality scores should be interpreted separately instead of with an aggregate FQS in the current study. In addition, PI is correlated to R with r = .18 (p = .01), and correlated to S with r = .30 (p = .00), while R is correlated to S with r = .41 (p = .00; see Table 4 in the section "correlations with other variables").

Cohort comparison

On average, pre-implementation report comments were provided 61.1% of the time (concerning five possible comment zones) and 58.6% of the time for post-implementation reports (concerning eight possible comment zones). However, this difference is not significant (t [393.47] = .95, p = .34). When comparing in terms of role coverage, the number of reports having covered a given role increased for all roles except Leader (see Figure 1).

Figure 1: Number of reports containing comments about each CanMEDS role.

Note. *: p < .05; PROF = Professional; COMM = Communicator; SCHO = Scholar; MDEX = Medical expert; HADV = Health advocate; COLL = Collaborator; LEAD = Leader. Professionalism: χ2(1) = 7.53, = .01; Communicator χ2(1) = 15.42, = .00; Scholar χ2(1) = 36.50, = .00; Medical expert χ2(1) = 9.14, = .00; Health advocate χ2(1) = 18.56, = .00; Collaborator χ2(1) = 14.18, = .00; Leader χ2(1) = 0.16, = .69.

Score comparison for ANOVA (see Table 2) revealed no difference whatsoever in feedback quality between cohorts. However, the difference between cohorts for relevance is marginally significant (F = 3.52, p = 0.06). Feedback quality scores per criterion are illustrated on an aggregate basis, per rotation level, and per CanMEDS role in Figure 2. When compared to the CanMEDS role, ANOVA revealed that relevance diminished for Scholar in the post-implementation cohort (F = 4.63, p = 0.03; η2 = .03), as well as for Specificity for Scholar (F = 13.63, p = 0.00; η2 = .08); Health Advocate (F = 4.48, p = 0.04; η2 = .09); Collaborator (F = 17.11, p = 0.00; η2 = .10); and Leader (F = 5.96, p = 0.02; η2 = .03). According to Cohen's rule of thumb (1988), eta-squared effect sizes of .02 are small and of .13 are medium. The other pre-post comparisons presented in Figure 2 are not statistically significant.

Table 2: Mean quality scores per cohort, cohorts differences in ANOVA, and CanMEDS coverage reports

  Cohort samples Difference testing




(n = 200)

F p n2
Mean Feedback Quality Score (SD)

All (R1 & R2)

65,1% (0.25) 67.6% (0.24) 0.99 .32 n.s.


83.9% (0.25) 88.0% (0.18) 3.52 .06 n.s.


69.2% (0.33) 69.0% (0.28) 0.01 .94 n.s.

  Pathways for improvement

42.0% (0.99) 45.5% (0.50) 0.99 .32 n.s.


65.1% (0.24) 70.4% (0.25) 1.98 .16 n.s.


81.4% (0.27) 87.2% (0.20) 2.63 .11 n.s.


67.4% (0.32) 66.6% (0.31) 0.07 .79 n.s.

  Pathways for improvement

46.0% (0.50) 57.6% (0.50) 2.43 .12 n.s.


65.1% (0.25) 65.1% (0.23) 0.00 .99 n.s.


86.3% (0.23) 89.1% (0.16) 1.00 .32 n.s.


71.1% (0.34) 71.8% (0.25) 0.03 .85 n.s.

  Pathways for improvement

42.0% (0.49) 34.0% (0.48) 0.34 .56 n.s.


88.0% (0.22) 85.4% (0.26) 1.20 .27 n.s.


87.0% (0.22) 78.6% (0.30) 4.67 .03 .15


89.0% (0.21) 92.1% (0.18) 1.22 .27 n.s.
CanMEDS commentary

Number of reports with commentary

  Leader 143 166      
  Collaborator 115 152      
  Health Advocate 49 108      
  Medical expert 165 185      
  Scholar 10 38      
  Communicator 63 100      
  Professional 102 106      

Note. Statistical difference is tested through ANOVA. p = probability (significance is noted with values <.05). When significant, η2 is provided, indicating effect size, which is interpreted as following: ~ .01: small effect;  ~.09: medium effect; ~0.25: large effect.



Figure 2a: Feedback quality for pre (2013-2015) and post (2016-2018) CAT implementation


Figure 2b: Feedback quality for pre (2013-2015) and post (2016-2018) CAT implementation, per residency level

Note. PI = Pathways for improvement.


Figure 2c: Comparison of comments quality across cohorts, per CanMEDs role

Note. *: p < .05; PROF = Professional; COMM = Communicator; SCHO = Scholar; MDEX = Medical expert; HADV = Health advocate; COLL = Collaborator; LEAD = Leader; PI = Pathways for improvement.


Convergence with other variables

Correlations between total scores are presented in Table 3. Valence is only correlated with PI in the 2013-2015 cohort, and it is correlated with both PI and S but not with R in the 2016-2018 cohort. As for performance, the CAS is only correlated with PI.

Table 3: Correlations between feedback quality scores, feedback valence, and the competency acquisition score

  R S PI Valence % commented PSC CAS
 All cohorts
  Relevance - .39** .04 -.04 -.06 -.10*  

























% commented
















 Relevance - .39** 0.06 .04 -.07 -.12  

























 % commented
















 Relevance - .41** .16 -.13 -.04 -.23** -.02

























 % commented








   PSC           - .38**

Note. R: Relevance, S: Specificity, PI: Presence of pathways for improvement, PSC: percentage of success in competencies, CAS: Competency acquisition score; *p < .05, **p < .01.


Feedback quality and progress difficulties

Even though inferential statistics could not be computed between student groups, Table 4 shows mean quality scores for pre-CAT post-CAT implementation. There seems to be a tendency in both cohorts indicating that comments are more specific when students experience difficulties, and even more when there are progress concerns without failure. Before the CAT, the valence seemed to be drastically different if students experienced difficulties, a result that might be less radical with the CAT.

Table 4: Mean quality scores (in percentage) and standard deviation per cohort and residents' difficulty level

Cohort   Residents' performance
 Dependant variable   Expected Progress concerns Failure or in difficulty
2013-2015                 N =   185 7 8



83.67 (26.04)

92.43 (13.46)

81.38 (15.76)



68.16 (34.21)

82.00 (13.60)

83.38 (19.38)

 Pathways for improvement


37.30 (48.49)

1.00   (0.00)

1.00   (0.00)



92.22 (14.84)

39.00 (22.52)

34.38 (28.03)



61.01   (8.38)

48.00   (2.00)

43.50   (5.95)

2016-2018                 N =   159 16 25



86.57 (19.18)

96.25 (10.88)

92.04 (12.93)



64.76 (29.45)

90.88 (11.47)

81.80 (18.41)

 Pathways for improvement


44.94 (49.90)

50.00 (51.64)

48.00 (50.99)



95.34 (13.00)

53.50 (32.92)

42.88 (21.96)



69.14   (4.45)

63.44   (1.97)

56.80   (5.84)

Note. Progress concerns: students experiencing isolated competency achievement delays; Failure or in difficulty: based on the capstone.

Rotation comparison

Rotation comparison concerning the three quality criteria is illustrated in Figure 3. In the ANOVA for the 2013-2015 cohort, there were no statistical differences in R, S, nor PI scores (Welsh's FR = 1.91, p = .11; FS = 1.22, p = .30; Welsh's FPI = 2.38, p = .05), though the PI difference is marginally significant. Post-hoc contrasts highlight a large difference between family medicine and perinatality concerning PI (p = .04; Cohen's d = .67). In the 2016-2018 cohort, the PI scores actually vary across programs (Welsh's FR = 1.72, p = .16; Welsh's FS = 1.06, p = .40; Welsh's FPI [excluding psychiatry, being without variance] = 3.16, p = .03). This time, psychiatry (where the scores for all eight reports were zero) differed significantly from family medicine rotations (p = .00; Cohen's d = 1.53) and emergency (p = .00; Cohen's d = 1.44).

Figure 3. Mean feedback quality scores per program, for each cohort.


The goal of the present study was to further validate the new CAT by verifying whether changes in the form and content of reports were associated with changes in feedback quality concerning the written comments accompanying the reports. We established three criteria to determine written feedback quality in the context of CAT administration: relevance, specificity, and the presence of pathways for improvement. Low homogeneity in scores among the criteria suggests that they should be interpreted separately instead of as an aggregate score. As for our analysis frame, agreement between the raters is good, while it is excellent for the category Pathways for Improvement, suggesting that the criteria were clearly and unanimously understood by the raters. The higher agreement for PI might be because the definition was simpler and required less interpretation for the coders.

Results indicate that the CAT does not appear to be associated with an increase in feedback quality. Most comments were rated as relevant in both cohorts: in 83.9% of the 2013-2015 reports and in 88.0% of the 2016-2018 reports (though relevance is marginally higher in the 2016-2018 cohort), a finding which is encouraging. Specificity was a little less common, emerging respectively in 69.2% and 69.0% of the reports. The addition of more details and examples in the new report may have taken the words out of their mouths, as it were, leaving the clinical teachers with not much to add. Furthermore, the specificity level differences are comparable across cohorts, suggesting that implementation of the new CAT might not have impacted specificity. In general, pathways for improvement are suggested in only 42.0% (2013-2015) and 45.7% (2016-2018) of reports, which also do not differ across cohorts. While the latter percentages may appear low, the residents are performing at a very high level: an "expected" level of performance corresponds to a PSC (percentage of success for the competencies) of 50% in 2013-2015 and of 66.67% in 2016-2018 and the means are 59.86% and 67.14%, respectively. In this context, it may be difficult for clinical teachers to find any improvement to suggest to such outstanding residents.

Clinical teachers completed the comments section with a similar amount of feedback before and after implementation of the CAT. As Lefroy et al. (2015) states, grades should not be provided without commentary; hence, we should expect to obtain comments covering the seven CanMEDS roles. In this matter, the new report provides better overall coverage of the various CanMEDS roles (excluding Leader), but Health Advocate is still the least covered role in both cohorts. The fact that CanMEDS are better covered with the CAT might be attributable to the explicit effort that was made during its development to clarify each role in detail with concrete rubrics. This might have facilitated the feedback provider's role by reminding each provider of the aspects under assessment. Feedback quality is similar among cohorts, but when roles are taken separately, Scholar comments are less relevant and specific (small and medium-large effect, respectively); while specificity is reduced for Health Advocate (medium-large effect); Collaborator (medium-large effect); and Leader (small effect). When comparing the different rotations, clinical teachers in perinatality appeared to propose fewer pathways for improvement than in family medicine as regards the 2013-2015 cohort. Differences in PI scores also vary in general for the 2016-2018 cohort, with the eight reports in psychiatry not proposing any pathway for improvement whatsoever. While observation of the figures provided may suggest more differences, some samples are very small, reducing statistical power, thus suggesting that more investigation would be necessary before drawing further conclusions from these differences. Correlations suggest that lower performance is associated with more relevant and more specific comments (only with the PSC score), as well as more suggestions for improvement (and vice versa). Good performance is associated with positive comments, and poorer performance with negative comments. However, performance is not associated with the proportion of comments (i.e. the CanMEDS roles coverage).

Descriptive statistics contrasting groups based on their difficulties (in Table 4) reveal that comments are more specific when residents actually experience difficulties, and they are more relevant if the residents experience only some isolated difficulties (the “progress concern” group). Even if the sample size is too small to produce inferential statistics to compare cohorts, these preliminary results remain encouraging. Relevance seems similar among cohorts, while the increase of specificity for residents with some progress concern is even more distinguishable. Also, pathways for improvement appear to be proposed more often in the 2016-2018 cohort. Such an observation could be congruent with the new report’s aim, i.e. to support teachers’ feedback by providing explicit expectations and observable behaviours. This could help them target precise strengths and difficulties in their student’s performance. The phenomenon should definitely be further investigated in future research with larger samples of students with and without difficulties.

Some elements might have constituted limitations in the current study. Firstly, as the three criteria appear to be distinct from one another, it might be advisable to separate them into a few indicators, as they are constructs, not variables or objective realities (Nunnally and Bernstein, 1994). This would improve reliability by broadening the raters' understanding of those criteria, hence boosting inter-rater reliability, and also adding to the internal consistency of the scale. This nuance might seem technical but it is very important for an apposite interpretation of the R, S, and PI scores. For example, "relevance" is a rather interpretable term, even with the definition we provided. On the one hand, some indicators could be added, such as "relates explicitly to aspects of the CanMEDS definition;" "is based on facts and avoids generalities;" or "is based on remediable behaviors." On the other hand, inter-rater reliability is very strong, suggesting that the criteria were understood across the board.

A second limitation is that the independence of observations could not be respected with our data, as clinical teachers and residents may be involved in one report or more. This can have the effect of artificially inflating the correlations or other analyses based on covariance. The third limitation relates to the feedback quality raters. Both cohort feedback quality criteria were coded by CS and LDB, but LC only coded the 2016-2018 reports. This resulted in a slight divergence in final score attribution and some limitations in the 2013-2015 triangulation of data. Fourth, the descriptive statistics about students in difficulty might suggest promising hypotheses, but the samples are too small to yield clear results or conclusions on the subject. In light of these limitations, the study could be improved in any replication, but none of the limitations constitutes a major obstacle to reaching the study objectives.

To the best of our knowledge, there was no framework adapted to written feedback accompanying periodical summative reports which could be applicable to our study. Therefore, we hope to have provided a basic explanation and breakdown of indicators to facilitate future feedback quality assessment and research on the subject. In addition to more specific indicators for each of the three criteria, this framework could also benefit from a comparison of the scores with the recipients' point of view about the constructiveness of the feedback they received as well as whether or not they perceive it to be actually constructive and beneficial for learning. With a larger sample, norms could also be established in order to convert the scores into interpretable qualifiers and an expectation cut-off. While validity based on consequences of testing is supported for the new version of the CAT, further investigation is needed to develop a complete framework for assessing written feedback in the context of summative reports, which could help clinical teachers improve the quality of their own feedback. Indeed, the changes to the assessment and its report were numerous: 1) the competencies being measured were changed; 2) the measurement is now criterion-based, compared to the previous normative measurement; 3) the tool is now computerized, while it was previously on paper; and 4) competency indicators with examples are now provided. Because all the changes occurred at the same time across the program, it is impossible to isolate the contribution of each of these to the results we observe. While some changes might have improved feedback quality, some might have hindered it, resulting in only small observed changes. This limitation is not unusual in educational feedback studies in medical education; feedback is often confused with other interventions, as different changes to practices are implemented simultaneously (Veloski et al., 2006). Also, few studies involve randomized educational trials in order to rigorously estimate the different approaches to feedback (Bing-You et al., 2017). In the end, while some differences were observed among cohorts, none of them suggest that changes to the assessment of family medicine residents had substantive consequences for feedback quality, which was expected at the very least. Ideally, we would have hoped that the new CAT improved constructiveness of written comments, but this does not appear to be the case. As a complement, it would be interesting to investigate residents' points of view about the quality of the feedback they received. Tools have been developed to look into the question (e.g. Bing-You et al., 2018). Though feedback quality was not affected from our point of view, it might be more appreciated and perceived as helpful by the residents themselves.


We believe it is essential to take into consideration feedback quality as a consequence of testing in medical education when validating the implementation of new testing methods. Nevertheless, even if clinical teachers know how to draft and provide constructive feedback, many factors can influence feedback quality and the recipient's adherence to the procedures that the feedback may suggest. The challenge for clinical teachers relates to all the engagement required to nurture the underlying dynamic process of feedback quality and educational alliance. It is essential for feedback providers and receivers to engage in a long-term relationship, where mutual respect is established, along with trust in the other’s good intent. It opens the door to free expression, both for the teacher who should feel comfortable in pointing out the learner’s difficulties, as well as for the learner in asking questions, expressing uncertainties, and communicating learning needs. In an era of computerization, there is a risk that such complex interaction might be less naturally established. In this context, continuing efforts must be made to support quality written and computerized feedback practices.

Take Home Messages

  • The literature suggests that when feedback is delivered with periodic written reports, feedback quality is determined by relevance, specificity, and the presence of pathways for improvement.
  • Feedback quality did not change with the implementation of a criterion-based, computerized, and summative assessment report.
  • Computerized feedback aimed at supporting clinical teachers' decisions with examples appears to contribute to better coverage of the various CanMEDS roles.
  • Comments appear to be more specific for learners in difficulty. When learners only experience some difficulties without being at risk of failure, feedback appears to be even more specific.

Notes On Contributors

Caroline Simard, PhD, was at the time of the study a research professional in medical education at the Department of Family Medicine and Emergency Medicine, Université Laval, Quebec City, Canada. ORCID ID:

Luc Côté, MSW, PhD (ed.) is professor at the Department of Family Medicine and Emergency Medicine, Université Laval, Quebec City, Canada.

Laurie de Bruyn, was at the time of the study a medical student, Université Laval, Quebec City, Canada.

Miriam Lacasse, MD MSc CCFP, is a family physician and associate professor at the Department of Family Medicine and Emergency Medicine, Université Laval, Quebec City, Canada. She co-chairs the Educational Leadership Chair in Health Professions Education and is evaluation director for the family medicine residency program. ORCID ID:


The authors would like to thank the scientific committee working on the CAT-related studies, including Christian Rheault, Jean-Sébastien Renaud, Isabelle Savard, and Johanne Théorêt, who also participated in the broader CAT validation study as well as in the review of this manuscript.

Figures 1-3 source: The Authors.


Archer, J. (2010) ‘State of the science in health professional education: effective feedback’, Medical Education, 44, pp. 101–108.

Baron, R. (1988) ‘Negative effects of destructive criticism: Impact on conflict, self-efficacy, and task performance’, Journal of Applied Psychology, 73(2), pp. 199–207.

Bing-You, R., Hayes, V., Varaklis, K., Trowbridge, R. et al. (2017) ‘Feedback for Learners in Medical Education: What Is Known? A Scoping Review’, Academic Medicine, 92, pp. 1346–1354.

Bing-You, R., Varaklis, K., Hayes, V., Trowbridge, R., et al. (2018) ‘The feedback tango: An integrative review and analysis of the content of the teacher-learner feedback exchange’, Acad Med, 93, pp. 657663.

Boehler, M., Rogers, D., Schwind, C., Mayforth, R. et al. (2006) ‘An investigation of medical student reactions to feedback: A randomised controlled trial’, Medical Education, 40(8), pp. 746749.

Bowen, L., Marshall, M. and Murdoch-Eaton, D. (2017) ‘Medical Student Perceptions of Feedback and Feedback Behaviors Within the Context of the "Educational Alliance"’, Academic Medicine, 92, pp. 13031312.

Brehaut, J., Colquhoun, H., Eva, K., Carroll, K., et al. (2016) ‘Practice Feedback Interventions: 15 Suggestions for Optimizing Effectiveness’, Annals of Internal Medicine, 164, pp. 435441.

Brinko, K. (1993) ‘The Practice of Giving Feedback to Improve Teaching: What Is Effective?’, The Journal of Higher Education, 64(5), pp. 574593.

Burgess, A. and Mellis, C. (2015) ‘Engaging Medical Students in the Basic Science Years with Clinical Teaching’, Journal of Medical Education and Curricular Development, 2, pp. 4951.

Canavan, C., Holtman, M., Richmond, M. and Katsufrakis, P. (2010) ‘The quality of written comments on professional behaviors in a developmental multisource feedback program’, Academic Medicine, 85(10), pp. S106S109.

Chur-Hansen, A. and McLean, S. (2006) ‘On being a supervisor: the importance of feedback and how to give it’, Teaching, Training and Further Education, 14(3), pp. 6771.

Ende, J. (1983) Feedback in clinical medical education’, JAMA, 250, pp. 777781.

Ericsson, K., Charness, N., Feltovich, P. and Hoffman, R. (2006) The Cambridge handbook of expertise and expert performance. New York: Cambridge University Press.

Frank, J., Snell, L. and Sherbino, J. (2015) CanMEDS 2015 Physician Competency Framework. Royal College of Physicians and Surgeons of Canada, Ottawa.

Ghazal, L., Gul, R. and Khowaja, A. (2015) ‘A step by step methodological approach to appraise the quality of written feedback using comment analysis’, British Journal of Education, Society and Behavioural Science, 10(4), pp. 110.

Hattie, J. and Timperley, H. (2007) ‘The power of feedback’, Review of Educational Research, 77, pp. 81112.

Jongekrijg, T. and Russell, J. (1999) ‘Alternative Techniques for Providing Feedback to Students and Trainees: A Literature Review with Guidelines’, Educational Technology, 39(6), pp. 5458.

Lacasse, M., Rheault, C., Tremblay, I., Renaud, J. S., et al. (2017) ‘Développement, validation et implantation dun outil novateur critérié dévaluation de la progression des compétences des résidents en médecine familiale’, Pédagogie médicale, 18(2), pp. 83100.

LeBaron, S. and Jernick, J. (2000) ‘Evaluation as a dynamic process’, Family Medicine, 32, pp. 1314.

Lefroy, J., Watling, C., Teunissen, P. and Brand, P. (2015) ‘Guidelines: the dos, donts and dont knows of feedback for clinical education’, Perspectives in Medical Education, 4, pp. 284299.

Liu, H. (2015) Comparing Welchs ANOVA, a Kruskal-Wallis test and traditional ANOVA in case of heterogeneity of variance. Master of Science thesis, Virginia Commonwealth University.

Mansouri, M. and Lockyer, J. (2007) ‘A Meta-Analysis of Continuing Medical Education Effectiveness’, Journal of Continuing Education in the Health Professions, 27, pp. 615.

Molloy, E., Ajjawi, R., Bearman, M., Noble, C., et al. (2018) ‘Challenging feedback myths, Values, learner involvement and promoting effects beyond the immediate task’, Medical Education, 00, pp. 17.

Norcini, J. and Burch, V. (2007) ‘Workplace-based assessment as an educational tool: AMEE Guide No. 31’, Medical Teacher, 29, 855871.

Nunnally, J. and Bernstein, I. (1994) Psychometric theory. 3rd edn. New-York: McGraw-Hill Inc.

Ossenberg, C., Henderson, A. and Mitchell, M. (2019) ‘What attributes guide best practice for effective feedback? A scoping review’, Advances in Health Sciences Education, 24, pp. 383401.

Parkes, J., Abercrombie, S. and McCarty, T. (2013) ‘Feedback sandwiches affect perceptions but not performance’, Advances in Health Sciences Education, 18, 397407.

Price, M., Handley, K. and Millar, J. (2011) ‘Feedback: focusing attention on engagement’, Studies in Higher Education, 36(8), pp. 879896.

Saraf, S., Bayya, J., Weedon, J., Minkoff, H. et al. (2014) ‘The relationship of praise/criticism to learning during obstetrical simulation: A randomized clinical trial’, Journal of Perinatal Medicine, 42(4), pp. 479486.

Sunderji, N. and Waddell, A. (2018) ‘Mixed methods convergent study designs in health professions education research: Toward meaningful integration of qualitative and quantitative data’, Academic Medicine, 93, pp. 1093.

Telio, S., Ajjawi, R. and Regehr, G. (2015) ‘The "educational alliance" as a framework for reconceptualizing feedback in medical education’, Academic Medicine, 90(5), pp. 609614.

Tomarken, A. and Serlin, R. (1986) ‘Comparison of ANOVA alternatives under variance heterogeneity and specific noncentrality structures’, Psychological Bulletin, 99, pp. 9099.

van de Ridder, J., McGaghie, W., Stokking, K. and Cate, O. T. J. et al. (2015) ‘Variables that affect the process and outcome of feedback, relevant for medical training: a meta-review’, Medical Education, 49, pp. 658673.

van den Boom, G., Paas, F., van Merrienboer, J. and van Gog, T. (2004) ‘Reflection prompts and tutor feedback in a web-based learning environment: effects on students self-regulated learning competence’, Computers in Human Behaviour, 20, pp. 551567.

Veloski, J., Boex, J., Grasberger, M., Evans, A., et al. (2006) ‘Systematic review of the literature on assessment, feedback and physicians clinical performance?: BEME Guide No. 7’, Medical Teacher, 28(2), pp. 117128.

Watling, C. and Ginsburg, S. (2019) ‘Assessment, feedback and the alchemy of learning’, Medical Education, 53, pp. 7685.




There are no conflicts of interest.
This has been published under Creative Commons "CC BY-SA 4.0" (

Ethics Statement

The research was exempted from the ethics approval process by the Université Laval ethics committee on February 24th 2015 (file # 2015-028). Reason for exemption: Article 2.5 from the Three Council Policy Statement (, stating: "Quality assurance and quality improvement studies, program evaluation activities, and performance reviews, or testing within normal educational requirements when used exclusively for assessment, management or improvement purposes, do not constitute research for the purposes of this Policy, and do not fall within the scope of REB review." Nevertheless, our study is in compliance with the Declaration of Helsinki.

External Funding

This study was granted funding by the Fonds Gilles-Cormier of the Université Laval Faculty of Medicine (no funding reference number) and received a Medical Council of Canada Research in Clinical Assessment grant (funding reference number: MCC-5/1718).


Please Login or Register an Account before submitting a Review

David Bruce - (26/01/2021) Panel Member Icon
I think the authors are to be congratulated for this thorough and detailed paper on how the introduction their new assessment tool affects the quality of feedback that the residents on their family medicine programme receive. The Competency Assessment Tool (CAT) was developed from the CanMEDS framework and was an on-line document that guided the teachers to provide feedback for each of the CanMEDS competencies. While the quality of feedback was not improved with the introduction of this tool, feedback comments covered a wider range of competencies.

The authors set out their ideas very clearly in the introduction stating the need to consider the consequential validity of an assessment tool. This is one part of the utility of an assessment tool and it’s good to see it being considered in this detail. I thought the review of feedback provided was excellent. As a minor point I wondered if the sentence - learner's perception of his or her trainee's authenticity – should read the teachers perception? The methodology for scoring the feedback quality was outlined and again demonstrated the thoroughness of the authors approach. Some more detail about how feedback was provided in the pre-CAT group would have been helpful – I was not clear if this was also a structured approach or just free text comments. The context of the feedback quality which correlated the three feedback criteria with the assessment of the valence of the comments and the performance score was difficult to follow as a non-expert reader. The results are clearly set out in full detail with good use of graphs and tables. Finally, the authors consider the results and (in my view anyway) perhaps understate the importance of their findings. The new tool facilitates comments in each of the CanMEDS competencies while not decreasing any of the feedback quality the residents receive. I did wonder if providing an on-line tool that guided the teachers into making comments that cover the all the CanMEDS competencies might have resulted in less specific, relevant comments and less suggestions for improvement – so I would see the results in a more positive light. Also, of note is the finding that feedback quality remains high in the key groups of residents who are struggling and not progressing as their teachers would like.

An excellent paper. I wondered if a summary version for non-expert readers would be helpful to increase the impact of the authors message.

Alice Fornari - (21/01/2021)
Thank you for addressing in your research an ongoing struggle in medical education, quality feedback. connection to a conceptual framework like CanMeds is valuable.