Comparison of Data Labelling Techniques for Automating Postcode Extraction in NLP-Supported Early-Stage Building Design

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Data labelling is crucial for the success of Natural Language Processing (NLP) models, as the quality of labelled data directly affects model accuracy and performance. In early-stage construction design, automating the data extraction of textual data is essential for integrating physical and digital workflows. However, data labelling presents significant challenges, requiring careful trade-offs between time, cost, and accuracy to meet project-specific needs. This paper compares three primary data labelling techniques for postcode extraction from project documents: manual, rule-based, and hybrid machine learning approaches. A review of the seminal literature reveals that manual labelling delivers high accuracy and quality but is labour-intensive and better suited for small datasets or creating gold standards. Rule-based techniques, such as regular expressions (Regex), automate labelling for structured data using predefined patterns, offering efficiency but requiring domain expertise. Machine learning-driven methods, like Named Entity Recognition (NER), enable scalability for large datasets but often demand task-specific fine-tuning. Due to suboptimal NER performance in initial testing, a hybrid approach combining Regex with NER was developed and implemented using Google Colab. Through empirical evaluation of postcode extraction from construction project documents, the rule-based approach achieved 96.7% accuracy when compared against manual labelling as the gold standard, while the hybrid machine learning approach achieved 98% accuracy. This paper provides a comparative framework to guide practitioners in selecting the most appropriate data labelling technique based on their specific needs, balancing accuracy, efficiency, and scalability to optimise workflows and enhance automation in early-stage building design.
Original languageEnglish
Title of host publicationProceedings of Digital Frontiers in Buildings and Infrastructure International Conference Series
Subtitle of host publicationDFBI 2025
EditorsFarzad Rahimian, Mohammad Fotouhi, M. Reza Hosseini, Amirhosein Ghaffarianhoseini, Sattar S. Emamian
PublisherDFBI
Pages266-386
Number of pages11
ISBN (Print)9781068436017
Publication statusPublished - 12 Aug 2025
EventInternational Conference on Digital Frontiers in Buildings and Infrastructure (DFBI2025) - TU Delft, Faculty of Civil Engineering and Geosciences Technische Universiteit Delft, Delft, Netherlands
Duration: 11 Jun 202513 Jun 2025
https://www.dfbi.net/

Conference

ConferenceInternational Conference on Digital Frontiers in Buildings and Infrastructure (DFBI2025)
Abbreviated titleDFBI 2025
Country/TerritoryNetherlands
CityDelft
Period11/06/2513/06/25
Internet address

Fingerprint

Dive into the research topics of 'Comparison of Data Labelling Techniques for Automating Postcode Extraction in NLP-Supported Early-Stage Building Design'. Together they form a unique fingerprint.

Cite this