Research article
Open Access

Composite Reliability of Workplace Based Assessment of International Medical Graduates

Balakrishnan Nair[1], Joyce M. W. Moonen – van Loon[2], Mulavana Parvathy[3], Cees P. M. van der Vleuten[2]

Institution: 1. University of Newcastle, Centre for Medical Professional Development, 2. Faculty of Health, Medicine and Life Sciences, Maastricht University, 3. Centre for Medical Professional Development, HNE Health, Newcastle
Corresponding Author: Professor Balakrishnan Nair ([email protected])
Categories: Assessment, Professionalism/Ethics, Learning Outcomes/Competency
Published Date: 30/04/2021



All developed countries depend on International Medical Graduates (IMGs) to complement their workforce. However, the assessment of their fitness to practice and acculturation into the new system can be challenging. To improve this, we introduced Workplace Based Assessment (WBA), using a programmatic philosophy. This paper reports the reliability of this new approach. 


Over the past 10 years, we have assessed over 250 IMGs, each cohort assessed over a 6-month period. We used Mini-Cex, Case Based Discussions (CBD) and Multi-Source Feedback (MSF) to assess them. We analysed the reliability of each tool and the composite reliability of 12 Mini-Cex, 5 CBDs and 12 MSF assessments in the tool kit.


A reliability coefficient of 0.78 with a SEM of 0.19 was obtained for the sample of 236 IMGs. We found the MSF to be the most reliable tool. By adding one more MSF to the assessment on two occasions, we can reach a reliability of 0.8 and SEM of 0.18.


The current assessment methodology has acceptable reliability. By increasing the MSF, we can improve the reliability. The lessons from this study are generalisable to IMG assessment and other medical education programs.

Keywords: Composite reliability; reliability; programmatic assessment; International medical graduates; performance assessment; Workplace Based Assessment


International medical graduates (IMGs) make up to 30 % of the workforce in countries like Australia, United States, U.K. and Canada (Patel et al., 2018). All these countries have multicultural populations and hence their contribution to health care provision is culturally appropriate (Pinsky, 2017). However their journey in the health care system is often challenging (House of Representatives Standing Committee on Health and Ageing, 2012). The IMGs end up working in remote and unpopular locations and specialties. In spite of this, their scientific and academic contributions are significant. Approximately 18% of scientific publications are from IMGs and 18.3 % of professors are IMGs in the USA (Khullar et al., 2017). They give excellent care to their patients in spite of concerns from some quarters. In a study done on the outcome of 244,153 hospitalisations for congestive heart failure and acute myocardial infarction in Pennsylvania, treated by IMGs, there was no mortality difference compared to U.S. graduates (Norcini et al., 2010). In spite of this, the rates of disciplinary action against IMGs are higher than that for local graduates (Alam et al., 2017). There could be many reasons for this. Poor communication, lack of cultural awareness and issues with patient centred care are postulated as some reasons for this. Other causes could be economic pressures with resettlement in the new country, lack of orientation to the new health system and lack of mentorship, and performance assessments (Hyder, 2017).

Because of the complexity of these issues, there had been suggestions to change the assessment for IMGs before they qualify to practice in the new country. For example, the International English Language Test System (IELTS) tests language proficiency and uses role players as one of the assessment tools. This may not be sufficient to test the linguistic skills of IMGs (Tiffin et al., 2014).

In recent years, Workplace Based Assessment (WBA) has become popular in medical education (Norcini and Burch, 2007). The practice of medicine is a complex issue and medical knowledge alone will not be sufficient to practise. Often it is what the doctor does is more important than what the doctor knows for the individual patient and society. Because of this, there is more interest in performance based assessment in recent times (Whelan et al., 2002). Many undergraduate programs have introduced programmatic assessment using WBAs to assess the performance of the learner over a period of time. Programmatic assessment is when low stakes assessments are used in conjunction with immediate feedback leading to an aggregated summative decision making. This assessment will detect issues in the traditionally difficult areas to assess, like communication skills, teamwork and professionalism (Wilkinson et al., 2011). Most post graduate training programs are also introducing WBA with authentic assessment tools. The advantage of programmatic assessment is regular assessment from multiple assessors over a period of time with very frequent constructive feedback (Chan and Sherbino, 2015). This will be assessment for learning rather than assessment of learning. 

To remediate some of these issues, we developed a WBA program for IMGs working in our hospitals. The traditional pathway for IMGs in Australia is IELTS, followed by an MCQ examination and an OSCE examination (Australian Medical Council, 2018). We offered a WBA program as an alternative and better option for IMG assessment. The Australian Medical Council accredited this program. We used Mini-Cex, Case Based Discussions (CBD) and Multisource feedback (MSF) as the main tools for assessment (Nair et al., 2012). This was a 6-month longitudinal performance assessment. We found this format was acceptable to the IMGs and assessors (Nair et al., 2015). Moreover, this assessment is cost effective and a good investment in the long term (Nair et al., 2014).

We have assessed over 250 IMGs over the past 10 years. While we know the reliability of individual assessment tools used in the WBA (Castanelli et al., 2019), we need to know the composite reliability of these tools when used in a tool kit (Nair et al., 2017). This paper is an extended analysis with a larger sample size. We believe that the lessons learnt can be used in other settings, both in undergraduate and post graduate assessments.


IMGs who have passed the IELTS and MCQ examination have to wait to get into the OSCE clinical examination. Some of them are employed on a provisional registration to work as junior doctors in hospitals where there is doctor shortage. We set up a program in 2010 after getting accreditation from the AMC to evaluate their performance as an alternative to the 3-hour OSCE examination. The candidates attended a session where they were oriented  and trained about  the 3 assessment tools (Mini-Cex, CBD and MSF). 

These assessment tools are well known and validated. The Mini-Cex was developed to test the clinical performance of the trainee. This is typically done in under 30 minutes, including time for immediate constructive feedback from the assessor (Garibaldi et al., 2002). The CBDs are to test the clinical reasoning and record keeping. The candidates select a patient whom they had looked after and the assessors will spend less than 30 minutes for assessment and feedback (Norcini and Burch, 2007). Multisource feedback had been used in management for a long time and is becoming popular in performance-based assessment. The candidates nominate colleagues, both medical and nonmedical, and the assessors usually send in the evaluation. This had been reported to be a valid and reliable tool (Miller and Archer, 2010).

Our assessment period was 6 months; the IMGs had to do 12 Mini-Cex assessments in medicine, surgery, women’s health, paediatrics, mental health and emergency medicine, 2 in each discipline. They were blueprinted to cover all domains including physical examination, history taking, counselling and prescribing. The candidates had to pass 8 cases, with at least one pass in each discipline, in order to pass the education program.

In the initial period, we used 7 CBDs and requested 12 MSF assessments on 2 occasions. We realised this was difficult to get and reduced the numbers to 5 CBDs and 6 MSF on 2 occasions. The IMG had to do 5 CBDs on patients they had managed to assess their record-keeping and clinical reasoning. They had to pass 4 out of 5 CBDs. Where possible, for the CBD and Mini-Cex assessments, we used different assessors.

At month one, they had to nominate 6 colleagues who had sufficient knowledge about their performance (3 medical and 3 nonmedical) for MSF. We had stipulated that the medical colleagues should be senior clinicians and the nonmedical colleagues should be nurses and allied health professionals. Once they nominated assessors, the rating forms were sent out from the central office to make the rating confidential and anonymous. The candidates were given the de-identified rating scores and a multidisciplinary team gave them constructive feedback. Remediation was offered to candidates if needed, including one to one communication skills training. At month 6, the candidates nominated another 6 different colleagues.

An executive committee, including clinicians and educators, oversaw the program and decided on pass /fail outcomes. If there were any procedural issues or appeal from the candidates, the assessment was reviewed by the Director. Only on less than 10 occasions were the candidates given a second chance for the assessment.

We had trained over 170 clinicians on WBA and assessment tools. They all attended a 3-hour calibration session before they were eligible to assess. At this session, they were given the rationale for this assessment and shown videos of the assessment scenarios. They independently marked each scenario, followed by feedback from experts in an interactive session. The emphasis on the training session was about multiple assessments by multiple assessors and immediate constructive feedback. The executive committee was able to review the assessments. We also did periodic feedback and upskilling sessions for them.

All assessments were scheduled by the administrator, with at least 2 weeks’ notice. All candidates knew the blueprint and the schedule of assessments. They attended a 3-hour orientation session before the program.


All candidates and assessors gave consent to evaluate the data. The research was approved by the Health Services Research and Ethics committee (approval number A.U.- 201607-03) of the Health Service.


Data analysis

We collected all completed MSF, Mini-Cex and CBD assessments. The Mini-Cex and CBD assessments contain 7 questions to be assessed on a 1-9-point scale. The MSF rating sheet contains 23 questions on a 1-5-point scale. The MSF assessment forms are different to medical and nonmedical colleagues, since different assessors are likely to see different behaviour of the candidates. Therefore, they are treated as separate measures of performance for the candidates and thus as different types of assessment. All these forms were validated in previous studies. To assure homogeneity among the assessments in the portfolio, we transformed the 5-point scale used in MSF to a 9-point scale (i.e. answer times 2 minus 1) in the dataset. In this transformation, each answer is multiplied by 2 and subtracted by 1, to assure that e.g. score “1”, “5”, “9” on the 9-point scale equals score “1”, “3”, “5” on the 5-point scale. For every assessment, the average score of all answered questions is determined and used in the calculations. Empty assessments were omitted from the analysis. Candidates for which the number of CBDs and/or Mini-Cex did not satisfy the quantitative requirements of 5 and 12, respectively, were excluded from the dataset, as well as the IMGs that did have less than 10 assessments for the MSF in total. The data of 236 IMGs are included in the dataset for analysis. 


For the generalizability study, which is completely performed in R (R Core Team, 2019), we use a nested design where the unique assessment (i) is nested in the facet of candidates (p), i:p. The assessors are no facet in this design, as the set of assessors is very large and possibly unique per candidate because of the characteristics of the MSF. We determined the reliability and Standard Error of Measurement (SEM) for each of the three assessment types, using the required number of assessments as dictated by the program. And we performed a D-study for each assessment type with varying number of assessments per type.

Next, we analysed the composite reliability of the three types together using multivariate generalisability theory (Brennan, 2001), combined in a portfolio, using a similar technique as described in Moonen – van Loon et al. (2013). Here, we also used the by the program required number of assessments as available in the dataset. To determine the composite universe score and composite error score that are needed for the reliability, the variances, covariances and absolute error scores of (combinations of) the available assessment types are combined using so-called weights, in which each type is assigned a percentage of impact on the composite reliability. By choosing a certain set of predetermined weights to apply in calculating the composite reliability, it gives assessors an indication on the importance of each assessment type in the complete set of available assessments while deciding of the performance of the candidates.

Finally, we changed the number of assessments to analyse the number of assessments per type that are needed to obtain a reliability ³ 0.80 and SEM  £ 0.26, which are the widely used acceptable thresholds for reliability (Crossley et al., 2002), to see which changes in the assessment program could be made to reliably make high-stakes decisions on the performance of candidates.



The cleaned dataset consists of 7472 assessments of 236 candidates. Table 1 presents the statistics on the dataset, including the number of assessments of the 236 candidates available as well as the average and standard deviation on scores, and the required and average number of assessments per candidate.

Table 1: Descriptive statistics


Case Based Discussion (CBD)

Multi Source Feedback (MSF)

Medical colleague

Multi Source Feedback (MSF)

Non-medical colleague


Number of assessments





Number of candidates





Average score (1-9-scale)





Standard deviation on the score (1-9-scale)





Number of assessments as required by the program





Average number of assessments






Performing the reliability analysis on each type of assessment using the required number of assessments, i.e. 5 Case Based Discussions (CBD), 6 Multisource Feedback (MSF) by medical colleagues and 6 by nonmedical colleagues and 12 Mini-Cex, we obtained the results presented in Table 2. The reliability coefficient is below 0.8 for each individual assessment type. We can conclude that the MSF is the most reliable tool in the portfolio. However, each type of assessment on its own will not lead to a reliability result with the currently used number of assessments.

Table 2: Reliability of each assessment type using the number of assessments as required by the program



MSF Medical colleague

MSF Non-medical colleague


Reliability coefficient





Standard error of measurement (SEM)






Figure 1 shows the reliability coefficient for varying numbers of assessments. To obtain reliable results for each assessment type individually, the program would have to do at least 30 CBD, 43 Mini-Cex and 16 MSF assessments per candidate over 2 rounds. For a 6-month program, this is not feasible for candidates and assessors.

Figure 1: Reliability Coefficient (y-axis) for varying number (nr; x-axis) of Case Based Discussions (CBD), Mini-Cex (MINICEX), medical (COLLEAGUE) and nonmedical (COWORKER) Multisource Feedback (MSF) assessments

Figure 1


However, in our program, results are taken together to contribute to the final pass/fail decision. Therefore, we calculated the composite reliability of the four types. The calculation of the composite reliability and SEM asks for a number of assessments per type and a weighting assigned to each type, where the sum of weights is equal to one, corresponding to the importance or impact of the assessment type in the complete assessment program. Similarly, as above, we use the, by the program, required number of assessments (5 CBD, 6 MSF per type and 12 Mini-Cex). We define the impact (or weight) per type to 13% CBD, 20% Mini-Cex and 67% MSF, equally split between the medical and nonmedical assessors. The assessments in the dataset following the requirements in the program lead to a composite reliability of 0.78 with a SEM of 0.19. 

In the purported dataset, candidates select on average around 6.75 assessors of each type in the MSF. If we were to change the criteria on the number of assessors from 6 per type in total (3 in first month, 3 in 6th month) to 7, then with the same weighting of the assessment tools, we obtain a reliable composite result of 0.80 with a SEM of 0.18.


Based on our data over the last 10 years, we believe, using 12 Mini-Cex, 5 CBD assessments combined with 6 MSF per round can provide an assessment program with satisfiable reliability for IMGs. As in any assessment program, the reliability should be balanced against acceptability, cost, educational impact and validity (van der Vleuten and Schuwirth, 2005; Castanelli et al., 2019). From our previous qualitative study, the acceptability is high from the learner perspective (Nair et al., 2015). They valued the immediate constructive feedback. The formative assessment program was an “educational journey” for them. They appreciated the opportunity to get to know the system and get acculturated. From the faculty point of view, they reported less pressure since this was a longitudinal assessment and they were part of a team of assessors. For the six-month program, the opportunity cost was 15,000 Australian dollars (Nair et al., 2015). This was acceptable to the health service and they saw this program as a long-term investment to produce safe and competent doctors, in areas where they were needed.

Any assessment on medical performance should be done using different tools, since practicing medicine is a complex activity and any single instrument will not fit the purpose. Hence any assessment should use multiple tools to provide breadth and to reduce the bias should use multiple observers.

To obtain reliable results for each assessment type individually, the program would have to include at least 30 CBD, 43 Mini-Cex and 16 MSF assessments from medical and non-medical colleagues. For a 6-month program, this is not feasible for candidates and assessors. However, when these assessments are combined, we can get a reliable assessment with 31 assessments in total. Moreover, when the assessment is spread out over 6 months using different assessors, the assessment fatigue is minimised and make it more acceptable to the busy faculty. In fact, we see that candidates collected more MSF assessments than required on average, for both medical and nonmedical colleagues, indicating that a very small increase in the required number of assessments seems feasible, leading to a reliability of 0.8.

The MSF was the most reliable tool in our study. It is not surprising since this is based on a more longitudinal observation of the trainee. This is consistent with the previous studies (Miller and Archer, 2010; Castanelli et al., 2019). 

Another strength of our study is its validity. As the trainees themselves described, this is an assessment done on real patients, by real clinicians in real hospitals and is an educational journey. They appreciated the immediate feedback and the supervisors reported progress of the candidates over the 6 months (Nair et al., 2015). So we believe our program fulfils all the requirements of a good assessment program, including reliability, acceptability, cost and educational impact (van der Vleuten and Schuwirth, 2005). 

However, this program is done by one centre and the other centres may have a different experience. As in any new program, faculty buy-in and training was challenging. It will be good to study the reliability and acceptability in other sites doing similar programs. However, we think what we have learned can be adapted in different educational settings including undergraduate and postgraduate program and should not be confined to IMG assessments. 


We believe the WBA program has good composite reliabilty and by adding 2 extra MSF assessments, we can increase the reliability to 0.8. The lessons learned can be extrapolated into other assessments, both in undergraduate and postgraduate medical assesments. This program is acceptable to the learners and assessors and is cost effective.

Take Home Messages

  • Performance assessment is more important than competency assessment
  • A programmatic assesment with different tools and multiple assessors will give good reliability
  • The assessments should be blue -printed
  • Constructive feedback in education is the key to improving performace

Notes On Contributors

Professor Balakrishnan Kichu Nair, MD, FRACP, FRCP is the Professsor of Medicine and Associate Dean at the Medical School in Newcastle, Australia. He is the Director of Continuing Medical Professional Development Unit at Hunter New England Health and is the Director of Educational Evaluation at Health Education and Training Institue of NSW. ORCID:

Dr Joyce M. W. Moonen - van Loon, PhD, is from the department of Educational Development and Research at Maastrict University. She is assistant professor, member of the taskforce "Instructional design and E-learning" with a focus on the use and implementation of portfolios. She has a background in Econometrics and received a PhD in Operations Research from Maastricht University in 2009. ORCID:

Dr Mulavana Parvathy, MBBS, FRCGP is the Director of the IMG Program at the Hunter New England Health at Newcastle and is a family physician.

Professor Cees van der Vleuten, PhD, has been at the Maastricht University in The Netherlands since 1982. In 1996 he was appointed Professor of Education and chair of the Department of Educational Development and Research in the Faculty of Health, Medicine and Life Sciences (until 2014). Since 2005 he has been the Scientific Director of the School of Health Professions Education (until 2020). He mentors many researchers in medical education and has supervised more than 90 doctoral graduate students. His primary expertise lies in evaluation and assessment. ORCID:


Kathy Ingham and Lynette Gunning for the Adminstrative support and data management.

Stephen Mears for editorial help.

Figure 1 was created by the authors using Microsoft Excel.


Alam, A., Matelski, J. J., Goldberg, H. R., Liu, J. J., et al. (2017) 'The Characteristics of International Medical Graduates Who Have Been Disciplined by Professional Regulatory Colleges in Canada: A Retrospective Cohort Study', Acad Med, 92(2), pp. 244-249.

Australian Medical Council (2018) Overview of assessment pathways. Available at: (Accessed: June 2020).

Brennan, R. L. (2001) Generalizability theory. 1st ed. 2001 edn. Springer.

Castanelli, D. J., Moonen-van Loon, J. M. W., Jolly, B. and Weller, J. M. (2019) 'The reliability of a portfolio of workplace-based assessments in anesthesia training', Can J Anaesth, 66(2), pp. 193-200.

Chan, T. and Sherbino, J. (2015) 'The McMaster Modular Assessment Program (McMAP): A Theoretically Grounded Work-Based Assessment System for an Emergency Medicine Residency Program', Acad Med, 90(7), pp. 900-5.

Crossley, J., Davies, H., Humphris, G. and Jolly, B. (2002) 'Generalisability: a key to unlock professional assessment', Med Educ, 36(10), pp. 972-8.

Garibaldi, R. A., Subhiyah, R., Moore, M. E. and Waxman, H. (2002) 'The In-Training Examination in Internal Medicine: an analysis of resident performance over time', Ann Intern Med, 137(6), pp. 505-10.

House of Representatives Standing Committee on Health and Ageing (2012) Lost in the labyrinth: report on the inquiry into registration processes and support for overseas trained doctors. Canberra: Commonwealth of Australia.

Hyder, S. (2017) 'International Medical Graduates Who Have Been Disciplined: Further Causes and Methods to Improve Quality of Care', Acad Med, 92(12), pp. 1651-1652.

Khullar, D., Blumenthal, D. M., Olenski, A. R. and Jena, A. B. (2017) 'U.S. Immigration Policy and American Medical Research: The Scientific Contributions of Foreign Medical Graduates', Ann Intern Med, 167(8), pp. 584-586.

Miller, A. and Archer, J. (2010) 'Impact of workplace based assessment on doctors' education and performance: a systematic review', Bmj, 341, p. c5064.

Moonen-van Loon, J. M., Overeem, K., Donkers, H. H., van der Vleuten, C. P., et al. (2013) 'Composite reliability of a workplace-based assessment toolbox for postgraduate medical education', Adv Health Sci Educ Theory Pract, 18(5), pp. 1087-102.

Nair, B. K., Moonen-van Loon, J. M., Parvathy, M., Jolly, B. C., et al. (2017) 'Composite reliability of workplace-based assessment of international medical graduates', Med J Aust, 207(10), p. 453.

Nair, B. K., Parvathy, M. S., Wilson, A., Smith, J., et al. (2015) 'Workplace-based assessment; learner and assessor perspectives', Adv Med Educ Pract, 6, pp. 317-21.

Nair, B. K., Searles, A. M., Ling, R. I., Wein, J., et al. (2014) 'Workplace-based assessment for international medical graduates: at what cost?', Med J Aust, 200(1), pp. 41-4.

Nair, B. R., Hensley, M. J., Parvathy, M. S., Lloyd, D. M., et al. (2012) 'A systematic approach to workplace-based assessment for international medical graduates', Med J Aust, 196(6), pp. 399-402.

Norcini, J., Boulet, J. R., Dauphinee, W. D., Opalek, A., et al. (2010) 'Evaluating the quality of care provided by graduates of international medical schools', Health Aff (Millwood), 29(8), pp. 1461-8.

Norcini, J. and Burch, V. (2007) 'Workplace-based assessment as an educational tool: AMEE Guide No. 31', Med Teach, 29(9), pp. 855-71.

Patel, Y. M., Ly, D. P., Hicks, T. and Jena, A. B. (2018) 'Proportion of Non-US-Born and Noncitizen Health Care Professionals in the United States in 2016', JAMA, 320(21), pp. 2265-2267.

Pinsky, W. W. (2017) 'The Importance of International Medical Graduates in the United States', Ann Intern Med, 166(11), pp. 840-841.

R Core Team (2019) R: A language and environment for statistical computing. Available at: (Accessed: June 2020).

Tiffin, P. A., Illing, J., Kasim, A. S. and McLachlan, J. C. (2014) 'Annual Review of Competence Progression (ARCP) performance of doctors who passed Professional and Linguistic Assessments Board (PLAB) tests compared with UK medical graduates: national data linkage study', BMJ, 348, p. g2622.

van der Vleuten, C. P. and Schuwirth, L. W. (2005) 'Assessing professional competence: from methods to programmes', Med Educ, 39(3), pp. 309-17.

Whelan, G. P., Gary, N. E., Kostis, J., Boulet, J. R., et al. (2002) 'The changing pool of international medical graduates seeking certification training in US graduate medical education programs', JAMA, 288(9), pp. 1079-84.

Wilkinson, T. J., Tweed, M. J., Egan, T. G., Ali, A. N., et al. (2011) 'Joining the dots: conditional pass and programmatic assessment enhances recognition of problems with professionalism and factors hampering student progress', BMC Med Educ, 11, p. 29.




There are no conflicts of interest.
This has been published under Creative Commons "CC BY-SA 4.0" (

Ethics Statement

All candidates and assessors gave consent to evaluate the data. The research was approved by the Health Services Research and Ethics committee of the Health Service (approval number A.U.- 201607-03).

External Funding

This article has not had any External Funding


Please Login or Register an Account before submitting a Review

Lorna Davin - (10/05/2021)
It is very interesting to read the nuanced outcomes of this paper addressing an important area of medical education, our diverse medical workforce, and patient care.
I especially appreciate the take-home message which emphasizes the importance of performance assessment over competency assessment. The approach taken allows the International Medical Graduates (IMGs) to be assessed within the messiness of their practice recognizing and embracing the social and cultural complexities of patient-care – which adds a depth of authenticity. The acceptance and buy-in of faculty is also significant. It is interesting to note how the longitudinal nature of the program, and the shared workload, reduced individual pressures of faculty.
The use of Multisource feedback, across the disciplines, with an emphasis on support and remediation, over time, is an obvious strength of this well-considered and managed program. The research outcomes emphasize validity and reliability. However, the program also adds a layer of transparency to the IMG assessment process, and in doing so I would argue provides the learner with a fairer and more equitable assessment, supporting this specific group of learners, and much needed part of our medical workforce, while they transition to clinical practice in an unfamiliar place and culture.
The authors are to be congratulated on this most useful contribution which adds to our understanding of how to better support learners and faculty while indirectly enhancing future patient care.
Possible Conflict of Interest:

Lorna Davin was Manager of the Centre for Medical Professional Development in the Hunter from 2004-2007.

Karen D'Souza - (07/05/2021)
The authors of this paper are leaders in the field of WBA and programmatic assessment. I was impressed by this group's work as they aimed to conduct the fairest, most reliable program of clinical assessment for IMGs whilst also considering the feasibility of delivering such a program of in the clinical setting. I applaud the group for their demonstration of the statistical analysis supporting their conclusions. Of particular note is the assessment performance of the MSF method of collecting datapoints, not only from medical supervisors but also other health professionals to build the most comprehensive and holistic clinical performance review. Their program of WBA also builds in feedback at ideal time points within the 6 month period of study - i.e. one MSF early, and one towards the end of 6 months, thus providing the opportunity for early feedback to shape the training of the IMGs, and then a later checkpoint for the program directors and trainees to review whether their training goals have been achieved to a sufficient degree. Sharing the rich feedback gained from the MSF forms with the IMGs allows the full potential of these assessments to be realised. Their results should be highly generalisable to other training contexts beyond IMG assessment.
Dujeepa D. Samarasekera - (06/05/2021) Panel Member Icon
Interesting paper and enjoyed reading it. The paper addresses an important challenging area in medical education and evaluating the competencies of physicians who migrate from one practice setting to another. The authors have discussed a more wholistic evaluation of an IMG's competence and their actual practice performance. Another strong point in this study is that they have collected data from over 200 IMGs in a 10-year period to make their final conclusions. Another important highlight is how they have designed this process aligned to the principles of programmatic assessment incorporating day-to-day workplace-based assessments in a practical and useful way both to the IMG and the supervisors. I am certain that this paper will assist many of us who are planning clinical assessments to evaluate trainees' performance for safe and effective practice. Apart from some minor limitations highlighted by other reviewers above, I am happy with the work done.
Chris Roberts - (03/05/2021)
Work based assessment is an important area of medical education research and policy development. Many MedEdPublish readers who are clinicians will either have experienced or be expected to assess student/trainee in the work place. This paper evaluates the reliability of Workplace Based Assessment (WBA) of International Medical Graduates (IMGs) in particular establishing the composite reliability of a WBA toolset consisting of MiniCEX CBD, and MSF. The authors have available 10 years of data, (7472 assessments of 236 candidates) and gives a good account of providing further validity evidence on a long standing system outside some of the main centres in the US and UK. It assures the stability of the estimates of the reliability/standard error of measurement of the tools used alone and in combination in this context. The paper also has value in describing a programmatic approach to standard setting and decision making at the end of the 6 month assessment period on the composite reliability of multiple samples of behaviour using multiple judges. The deepening of the validity evidence is important for many institutions internationally who offer WBAs but are keen to demonstrate robust assessment of learning for their major stakeholders in this kind of assessment.
The gold standard of > 0.8 perhaps drives us towards homogeneity in the various assessment types? Would reducing this the more realistic r= 0.7 help in this regard to have a more heterogenous sample of behaviours for the composite? I was interested to know for future research how the qualitative feedback received by trainees might enrich the progression decisions of the executive committee that considers the program of WBAs.
Possible Conflict of Interest:

No conflicts of interest to declare

Tarun Sen Gupta - (02/05/2021)
I enjoyed reading this paper from some of the leading proponents of work based assessment. Kichu Nair’s team have led the research agenda in WBA, underpinning the strong educational and practical aspects with a sound and detailed evaluation. Their data set is substantial - 250 trainees over 10 years - and the results likely to be significant in many other jurisdictions, as the authors note.

The programmatic aspect of this work is particularly important, presenting a ‘composite reliability’ in addition to reporting each component, which adds further weight to the argument regarding combining multiple assessment formats. Furthermore, in the time of COVID, educators world-wide are looking at alternatives to ‘big bang’ events, so this approach presents a credible option.

I also appreciated the positive acknowledgement of the important work that IMGs do, and their very significant contribution to medical workforce in many settings. There is also important local and international context and sufficient detail to allow others to replicate this program in their own environment.

The SEM quote of 0.18-0.19 did look very low to me, but I note this is on a reduced scale (9 points, reduced in some analyses to 5), not a percentage. This is a detail the authors may wish to clarify in any update. The wording ‘by the program required...’ in reference to the number of assessments is also a little clumsy.

These are only minor quibbles in what is an important addition to the literature. The authors’ comments on the number of assessments required for desired reliability (Figure 1) and the observation about the utility of MSF are worth noting as important findings in themselves. I also note their description of training clinic as in the various techniques before they are ‘eligible to assess’ which is often omitted or done poorly.

I think this article provides some useful evidence on which to further develop our assessment programs in a post-COVID world.
Sankar Sinha - (01/05/2021)
Prof Nair and his co-authors of this article are renowned International experts on developing the Workplace-Based Assesment program for the International Medical Graduates. This article is the culmination of their ten-year experience on Mini-Cex, Case Based Discussions (CBD) and Multisource feedback (MSF) assessment tools. It was indeed a pleasure to read this well-written concise article demonstrating the value of composite reliability in the WBA program. It is heartening to read their concluding statement “The lessons learned can be extrapolated into other assessments, both in undergraduate and postgraduate medical assesments.” I have been trying to implement this strategy for the Interns at our hospital based on the evidence from Prof Nair’s experience of WBA for the IMGs.
Ian Wilson - (01/05/2021) Panel Member Icon
This paper describes a response to a difficult problem. The assessment of international Medical graduates (IMGs) has been problematic for many years. Th use of WBA in a composite way in this paper demonstrates an effective, reliable method. The composite outcomes ensure a reliable method of assessment that is acceptable to those being assessed and produces results that enable rational decisions to be made.
Th authors of this study have demonstrated a method of assessing IMGs that will produce results that will lead to a much more reliable an effective method of ensuring the competence of those doctors about to enter practice.