Web news extraction is a very important step in the process of Web intelligent information processing. It is the basis of research and application of network public opinion monitoring, heterogeneous Web data source integration and information retrieval. Therefore, the research and design of Web news content information extraction method has important research and application value. Using the idea of web information extraction based on statistics and web structure, this paper improves an existing webpage text extraction algorithm named ERBDF and designs a web news text extraction algorithm based on statistics and DOM tree structure (EETD). Finally, two algorithms are tested and compared in the accuracy and speed of text extraction and the results show that EETD has a better overall performance.
|Title of host publication||Proceedings of 2017 International Conference on Engineering and Technology, ICET 2017|
|Publication status||Published - 8 Mar 2018|
|Event||2017 International Conference on Engineering and Technology - UniversityAntalya, Antalya, Turkey|
Duration: 21 Aug 2017 → 23 Aug 2017
|Conference||2017 International Conference on Engineering and Technology|
|Abbreviated title||ICET 2017|
|Period||21/08/17 → 23/08/17|