In:
RöFo - Fortschritte auf dem Gebiet der Röntgenstrahlen und der bildgebenden Verfahren, Georg Thieme Verlag KG, Vol. 189, No. 07 ( 2017-07), p. 661-671
Abstract:
Purpose Projects involving collaborations between different institutions require data security via selective de-identification of words or phrases. A semi-automated de-identification tool was developed and evaluated on different types of medical reports natively and after adapting the algorithm to the text structure. Materials and Methods A semi-automated de-identification tool was developed and evaluated for its sensitivity and specificity in detecting sensitive content in written reports. Data from 4671 pathology reports (4105 + 566 in two different formats), 2804 medical reports, 1008 operation reports, and 6223 radiology reports of 1167 patients suffering from breast cancer were de-identified. The content was itemized into four categories: direct identifiers (name, address), indirect identifiers (date of birth/operation, medical ID, etc.), medical terms, and filler words. The software was tested natively (without training) in order to establish a baseline. The reports were manually edited and the model re-trained for the next test set. After manually editing 25, 50, 100, 250, 500 and if applicable 1000 reports of each type re-training was applied. Results In the native test, 61.3 % of direct and 80.8 % of the indirect identifiers were detected. The performance (P) increased to 91.4 % (P25), 96.7 % (P50), 99.5 % (P100), 99.6 % (P250), 99.7 % (P500) and 100 % (P1000) for direct identifiers and to 93.2 % (P25), 97.9 % (P50), 97.2 % (P100), 98.9 % (P250), 99.0 % (P500) and 99.3 % (P1000) for indirect identifiers. Without training, 5.3 % of medical terms were falsely flagged as critical data. The performance increased, after training, to 4.0 % (P25), 3.6 % (P50), 4.0 % (P100), 3.7 % (P250), 4.3 % (P500), and 3.1 % (P1000). Roughly 0.1 % of filler words were falsely flagged. Conclusion Training of the developed de-identification tool continuously improved its performance. Training with roughly 100 edited reports enables reliable detection and labeling of sensitive data in different types of medical reports. Key Points: Citation Format
Type of Medium:
Online Resource
ISSN:
1438-9029
,
1438-9010
DOI:
10.1055/s-0043-102939
Language:
English
Publisher:
Georg Thieme Verlag KG
Publication Date:
2017
detail.hit.zdb_id:
2031079-1
Bookmarklink