ObjectiveThe current medical questionnaire resources are mainly processed and organized at the document level, which hampers user access and reuse at the questionnaire item level. This study aims to propose a multi-class classification of items in medical questionnaires in low-resource scenarios, and to support fine-grained organization and provision of medical questionnaires resources. MethodsWe introduced a novel, BERT-based, prompt learning approach for multi-class classification of items in medical questionnaires. First, we curated a small corpus of lung cancer medical assessment items by collecting relevant clinical assessment questionnaires, extracting function and domain classifications, and manually annotating the items with "function-domain" combination labels. We then employed prompt learning by feeding the customized template into BERT. The masked positions were predicted and filled, followed by mapping the populated text to labels. This process enables the multi-class classification of item texts in medical questionnaires. ResultsThe constructed corpus comprised 347 clinical assessment items for lung cancer, across nine "function-domain" labels. The experimental results indicated that the proposed method achieved an average accuracy of 93% on our self-constructed dataset, outperforming the runner-up GAN-BERT by approximately 6%. ConclusionThe proposed method can maintain robust performance while minimizing the cost of building medical questionnaire item corpora, illustrating its promotion value of research and practice in medical questionnaire classification.
ObjectiveTo construct a demand model for electronic medical record (EMR) data quality in regards to the lifecycle in machine learning (ML)-based disease risk prediction, to guide the implementation of EMR data quality assessment. MethodsReferring to the lifecycle in ML-based predictive model, we explored the demand for EMR data quality. First, we summarized the key data activities involved in each task on predicting disease risk with ML through a literature review. Second, we mapped the data activities in each task to the associated requirements. Finally, we clustered those requirements into four dimensions. ResultsWe constructed a three-layer structured ring to represent the demand model for EMR data quality in ML-based disease risk prediction research. The inner layer shows the seven main tasks in ML-based predictive models: data collection, data preprocessing, feature representation, feature selection and extraction, model training, model evaluation and optimization, and model deployment. The middle layer is the key data activities in each task; and the outer layer represents four dimensions of data quality requirements: operability, completeness, accuracy, and timeliness. ConclusionThe proposed model can guide real-world EMR data governance, improve its quality management, and promote the generation of real-world evidence.