Journal of the College of Physicians and Surgeons Pakistan
ISSN: 1022-386X (PRINT)
ISSN: 1681-7168 (ONLINE)
Affiliations
doi: 10.29271/jcpsp.2025.06.793ABSTRACT
The application of generative artificial intelligence (GAI) in medical education and practice has garnered increasing attention, particularly its significant potential to enhance personalised learning and clinical training. This viewpoint explores the integration of GAI into medical education, analysing its advantages in disseminating medical knowledge, simulating case scenarios, and supporting clinical decision-making. Although GAI introduces innovative opportunities to medical education, its practical application also presents various challenges, such as model accuracy and ethical concerns. The viewpoint further discusses the potential impact of these challenges on the future of medical education and offers corresponding strategies and recommendations, providing valuable insights for educators and policymakers. By understanding the practical applications and limitations of GAI in the medical field, this viewpoint aims to lay a foundation for more effective use of GAI in medical education in the future.
Key Words: Generative artificial intelligence, Large language model, Medical education.
The development of generative artificial intelligence (GAI) can be summarised as a shift from statistical models to deep learning and large-scale pre-trained models. In November 2022, OpenAI launched ChatGPT, a chatbot based on GPT-3.5, a large language model (LLM). The model uses a transformer architecture and is capable of performing zero-shot learning, effectively processing text data and generating coherent responses. In the following years, the advancement of Transformers and multimodal generative models has led to explosive growth. At present, in addition to the upgrade from GPT-3.5 to GPT-4 series, Google has launched Bard, PaLM 2, and Gemini. Meta has released the open-source model LLaMA, and Elon Musk's AI company has launched the open-source Grok. In the past decade, the authors have witnessed a series of achievements in AI, with Google's DeepMind being one of the most successful examples. In clinical practice, deep learning reduces the workload of healthcare professionals by automatically identifying features in MRIs, CT scans, x-rays, and other similar medical imaging modalities. Based on the immense prospects, vertical industries are also actively exploring and expanding GAI’s application potential, showing its extensive and far-reaching influence.
This viewpoint aims to summarise the current progress of GAI application in the field of medical education, analyse the new challenges and problems associated with these innovative applications, and make a forward-looking prediction of the future development path in medical education.
With the continuous iteration and upgrading of GAI technology, its application in medical education has become a field of great concern. GAI is deeply rooted in computational neurobiology, an interdisciplinary discipline that integrates the essence of medicine and computer science. In the field of medical education, GAI tools, such as ChatGPT, have the potential to promote self-directed learning among medical students by providing detailed and relevant information and answering students' questions in real time, such as questions about specific diseases, treatments, or procedures. GAI's key strength lies in its ability to provide flexible solutions that meet the personalised learning needs of medical students, enabling the transition from a traditional, pre-defined curriculum model to a fluid and adaptable learning framework. This shift has made it more autonomous and efficient for medical students to access the information they need.
Kung et al. used ChatGPT to participate in the United States Medical Licensing Examinations (USMLE). The results showed that the ChatGPT was able to pass USMLE without pre-training.1 ChatGPT is about 80% accurate in answering questions related to microbiology.2 Nevertheless, ChatGPT's accuracy in the field of nephrology, especially electrolyte and acid-base imbalances, glomerular diseases, etc., is relatively low. The reason is that the clinical problems of electrolyte and acid-base diseases often require complex calculations, while glomerular diseases require an in-depth understanding of renal pathology, physiology, and various treatment options that cover a wide range of topics such as immunology, genetics, and pharmacology.3 Safranek et al. evaluated ChatGPT's ability to identify and determine clinical causes. They found that due to its lack of clinical reasoning and cognitive ability, ChatGPT tends to construct and present a seemingly conclusive answer based on the weight of the data accumulated during its training, and also gives a logical and persuasive argumentation process. However, despite the rigorous construction of the answers, the causative judegment given is wrong.4 These findings suggest that while GAI is competent for tutoring memory knowledge, it should be used with caution in learning tasks involving deep comprehension, detailed analysis, precise calculation, or reasoning.
Chu et al. devised a new medical education tool that uses multimodal GAI to create realistic, interactive virtual scenarios and virtual synthetic patients, providing real-life conversations and videos that simulate telemedicine realities with high fidelity. These synthetic patients interact with users at different stages of medical care through a customised video chat app.5 Chheang et al. constructed an immersive virtual reality environment for human anatomy lessons, where users could deeply participate in the experience and communicate and interact verbally with an embodied virtual assistant based on GAI technology. Participants used the Valve Index VR headset and related components to navigate the VR system, and interacted with 3D models by grasping, resizing, and rotating to understand their functions. They evaluated the effectiveness and usability of the technology and showed that they scored significantly higher when answering knowledge-based questions than analytics-based questions.6 In general, although GAI has some shortcomings, through the combination of other information technologies, a new medical practice education model that integrates personalisation, efficiency, immersion, and iteration can be constructed.
Lehman et al. showed that a relatively small specialised clinical model can significantly outperform a generic domain model, even with fine-tuning on limited annotated data. In addition, pre-training with clinical markers can create smaller, more efficient models that can match or outperform generic domain models.7 In real-world medical scenarios, if sensitive medical information is fed into a public model, privacy and ethical concerns are raised. While LLMs show great potential for automatically generating feedback, they are often very expensive to develop and maintain. These issues have prompted consideration of open-source LLMs such as LLaVA and LLaMA in medical education that can be easily run locally, rather than mass- market, closed-source models such as ChatGPT. The use of open-source LLMs not only reduces the costs but also provides greater transparency and control, which is especially important for medical education. Processing open source LLMs through targeted fine-tuning and knowledge distillation can reduce the overall model size without compromising the basic functions of the LLM, and enable educational institutions and researchers to better understand and adjust the behaviour of the model to ensure that it meets educational and ethical standards. In addition, it provides an opportunity to develop customised medical education tools and resources to better meet the needs of specific learners. Derivative models can be used to develop guidance tools for virtual tutors or instructors to provide a personalised learning experience that enhances the interactivity and effectiveness of medical education.
Datasets are fundamental to realising the potential of GAI technology. These datasets should integrate the core data required for large natural language processing and meet the highest standards in the medical profession. The construction of the dataset relies on the precise definition of the content of the medical discipline, covering a wide range of medical knowledge and information, and closely aligning with real-world medical cases to ensure that the LLM can handle real-world complex situations and provide practical solutions. High-quality annotation can improve the model's comprehension of medical texts, thereby improving its application in educational technology. Given the continuous advancement of medical knowledge, datasets need to be regularly updated and maintained to reflect the latest medical discoveries and practices. Due to data privacy and confidentiality issues, there is currently a lack of high-quality public large clinical datasets for training LLMs. Though the gold standard for privacy preservation in model training is the implementation of differential privacy (DP), some research showed that DP has negative implications for model accuracy and fairness.8 Some researchers address this challenge by leveraging synthetic data generated from biomedical and clinical literature, as well as synthetic data from question-response datasets, to train and evaluate domain-specific models.9 There is no standard answer for how to evaluate the performance of LLM-derived models in medical education reasonably and effectively. No two studies used the same assessment method, suggesting that there was a wide divergence between different research teams in this regard. This divergence may lead to inconsistencies and difficulty in comparing the results of the study, which may affect the reliability and generalisability of the findings. Therefore, there is a need to develop new, scalable, and accurate assessment metrics and benchmark datasets to accommodate a wide range of medical tasks, ensuring that the assessments are both meaningful and reflect real-world and cutting-edge applicability.
From an international standpoint, in accordance with the EU AI Act, medical devices and education are deemed high-risk areas. Furthermore, all high-risk AI systems will undergo evaluation prior to their market introduction and throughout their entire lifecycle. At a micro level, with the development of advanced language models such as ChatGPT, more and more medical education institutions have begun to explore how to integrate GAI into the teaching process. However, there is a lack of conclusive evidence to support the significant improvement of learning efficiency through search and open-ended Q and A. On the one hand, students with weak self-regulated learning ability may have difficulty asking in-depth questions, and they need metacognitive support to improve their abilities to participate in learning more effectively. On the other hand, students may be distracted by content that is not relevant to the course content or personal long-term goals.10 In extreme cases, if healthcare professionals accept the output of GAI without fully understanding the training dataset and algorithm logic, they may face ethical questioning and legal liability.11
Plagiarism and cheating are often associated with ChatGPT's use in education, reflecting widespread concerns about its possible negative impact on academic integrity. Educational institutions need to develop guidelines and policies to ensure the responsible use of AI, protect data privacy, and maintain academic integrity. In addition, in a learning environment where ChatGPT is integrated, specific teaching methods and assessment strategies need to be clarified to maintain fairness. Current research has focused on the short-term effects of ChatGPT, but over time, ChatGPT and similar tools may have a lasting impact on teaching practices and educational outcomes.12 Therefore, the authors need to ensure that medical students should know to use this technology effectively and responsibly in the long term. Assessing the accuracy and quality of information sources is a crucial skill, and medical education should develop students' ability to critically evaluate medical literature, including assessing the credibility of authors, the reliability of sources, and external review.
Sejnowski argued that large language models (LLMs) can effectively perform complex reverse Turing tests. For example, the smarter the user, the deeper their hints may be, making the LLM appear smarter; If users hold strong views, LLMs may further deepen those views.13 Prompt engineering is based on this theory. Although the application of prompt engineering in GAI has been widely recognised, there are still some challenges. First, there is the problem of the robustness of the prompt. Prompts in the same frame may result in different responses due to minor word variations. Second, the effectiveness of prompt engineering depends on the inherent capabilities of the LLM itself. Hints that work for one model may not work for another. Therefore, there is a need to develop prompt engineering guidelines for patients and physicians according to specific requirements. To address these issues, some new prompting methods, such as the incremental reasoning chain of thought prompting, follow the usual clinical approach of making decisions in real-life clinical settings, whilst also meeting the underlying requirements for accuracy and verifiability.14
GAI errors are mainly caused by hallucinations and biases. Hallucinations manifest themselves as constructing facts out of nothing and interpreting information with overconfidence. Especially when it comes to generating coherent and logically self-consistent responses, hallucinations can mislead users into accepting false conclusions, which is considered a serious misleading error in the field of education. For example, when fine-tuning an LLM to a clinical psychology major, programmers may disagree on narcissistic statements, growth mindset statements, or advice for coping with stress. The fine-tuned model may retain the inherent biases in the underlying LLM training data, and may even become more biased, especially when the fine-tuners lack expertise or have their own biases.15 In conclusion, although technology offers certain solutions to alleviate these problems, prejudice and illusion are still difficult to completely eliminate. Whilst there is currently no comprehensive solution, some researchers have reviewed articles published between 2010 and 2023 on the biases arising in the process of training AI with electronic health records. They opine that in the future, biases can be mitigated in four ways: Data quality checks, bias detection and mitigation pipelines, interpretable biases in AI applications, and the validation of AI models.16
The biggest controversy is whether GAI is the right technical route to artificial general intelligence. LeCun argued that the current road to machine learning technology is problematic, and that AI is missing an important ingredient compared to humans and animals, although it is not clear what exactly this is.17 Fedorenko et al. echoed LeCun's view from the perspective that language and thinking are not necessary for each other.18
The development of AI relies on the in-depth study of biology, especially the exploration of frontiers of neurobiology, such as information transmission and storage in the cerebral cortex. Currently, there are still many unsolved mysteries, such as the source of consciousness and the storage of long-term memories. GAI technology is rapidly evolving, and it is expected that the next generation of GAI will integrate multimodal and intelligent agents, greatly reduce bias and illusions, and achieve a further leap in technology. In the coming years, GAI is likely to play an increasingly important role in society. Effective use of GAI requires practitioners to have some technical expertise, including a deep understanding of AI methods. The application of AI has always been ahead of theoretical developments, and the emergence of GAI has deepened the understanding of the puzzles of these technologies, such as the essential causes of the phenomenon of emergence and the black box problem of deep learning. In the future, it is expected that some of the basic theories of GAI will be explained through multidisciplinary research, in which the medical field will play an important role. Therefore, medical practitioners need to be trained and educated to improve their professionalism in this field. By providing training for medical students and faculty in GAI, it can help them better participate in the development of AI. This training not only helps to improve the effectiveness of GAI in the medical education, but also fosters interdisciplinary collaboration, promotes knowledge exchange, and technological innovation between medicine and computer science.
FUNDING:
Jiangsu overseas visiting scholar programme for university prominent young and middle-aged teachers and presidents. Jiangsu Provincial Higher Education Reform Research Project (No. 2023JSJG338).
COMPETING INTEREST:
The authors declared no conflict of interest.
AUTHORS’ CONTRIBUTION:
SW: Literature collection, literature analysis, and manuscript drafting.
RG: Manuscript drafting.
RX: Topic design and manuscript drafting.
All authors approved the final version of the manuscript to be published.
REFERENCES