Abstract
Data labelling is crucial for the success of Natural Language Processing (NLP) models, as the quality of labelled data directly affects model accuracy and performance. In early-stage construction design, automating the data extraction of textual data is essential for integrating physical and digital workflows. However, data labelling presents significant challenges, requiring careful trade-offs between time, cost, and accuracy to meet project-specific needs. This paper compares three primary data labelling techniques for postcode extraction from project documents: manual, rule-based, and hybrid machine learning approaches. A review of the seminal literature reveals that manual labelling delivers high accuracy and quality but is labour-intensive and better suited for small datasets or creating gold standards. Rule-based techniques, such as regular expressions (Regex), automate labelling for structured data using predefined patterns, offering efficiency but requiring domain expertise. Machine learning-driven methods, like Named Entity Recognition (NER), enable scalability for large datasets but often demand task-specific fine-tuning. Due to suboptimal NER performance in initial testing, a hybrid approach combining Regex with NER was developed and implemented using Google Colab. Through empirical evaluation of postcode extraction from construction project documents, the rule-based approach achieved 96.7% accuracy when compared against manual labelling as the gold standard, while the hybrid machine learning approach achieved 98% accuracy. This paper provides a comparative framework to guide practitioners in selecting the most appropriate data labelling technique based on their specific needs, balancing accuracy, efficiency, and scalability to optimise workflows and enhance automation in early-stage building design.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of Digital Frontiers in Buildings and Infrastructure International Conference Series |
| Subtitle of host publication | DFBI 2025 |
| Editors | Farzad Rahimian, Mohammad Fotouhi, M. Reza Hosseini, Amirhosein Ghaffarianhoseini, Sattar S. Emamian |
| Publisher | DFBI |
| Pages | 266-386 |
| Number of pages | 11 |
| ISBN (Print) | 9781068436017 |
| Publication status | Published - 12 Aug 2025 |
| Event | International Conference on Digital Frontiers in Buildings and Infrastructure (DFBI2025) - TU Delft, Faculty of Civil Engineering and Geosciences Technische Universiteit Delft, Delft, Netherlands Duration: 11 Jun 2025 → 13 Jun 2025 https://www.dfbi.net/ |
Conference
| Conference | International Conference on Digital Frontiers in Buildings and Infrastructure (DFBI2025) |
|---|---|
| Abbreviated title | DFBI 2025 |
| Country/Territory | Netherlands |
| City | Delft |
| Period | 11/06/25 → 13/06/25 |
| Internet address |