TY - GEN
T1 - Leveraging Collection-Wide Similarities for Unsupervised Document Structure Extraction
AU - Lior, Gili
AU - Goldberg, Yoav
AU - Stanovsky, Gabriel
N1 - Publisher Copyright: © 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - Document collections of various domains, e.g., legal, medical, or financial, often share some underlying collection-wide structure, which captures information that can aid both human users and structure-aware models. We propose to identify the typical structure of document within a collection, which requires to capture recurring topics across the collection, while abstracting over arbitrary header paraphrases, and ground each topic to respective document locations. These requirements pose several challenges: headers that mark recurring topics frequently differ in phrasing, certain section headers are unique to individual documents and do not reflect the typical structure, and the order of topics can vary between documents. Subsequently, we develop an unsupervised graph-based method which leverages both inter- and intra-document similarities, to extract the underlying collection-wide structure. Our evaluations on three diverse domains in both English and Hebrew indicate that our method extracts meaningful collection-wide structure, and we hope that future work will leverage our method for multi-document applications and structure-aware models.
AB - Document collections of various domains, e.g., legal, medical, or financial, often share some underlying collection-wide structure, which captures information that can aid both human users and structure-aware models. We propose to identify the typical structure of document within a collection, which requires to capture recurring topics across the collection, while abstracting over arbitrary header paraphrases, and ground each topic to respective document locations. These requirements pose several challenges: headers that mark recurring topics frequently differ in phrasing, certain section headers are unique to individual documents and do not reflect the typical structure, and the order of topics can vary between documents. Subsequently, we develop an unsupervised graph-based method which leverages both inter- and intra-document similarities, to extract the underlying collection-wide structure. Our evaluations on three diverse domains in both English and Hebrew indicate that our method extracts meaningful collection-wide structure, and we hope that future work will leverage our method for multi-document applications and structure-aware models.
UR - http://www.scopus.com/inward/record.url?scp=85205282621&partnerID=8YFLogxK
M3 - منشور من مؤتمر
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 9538
EP - 9550
BT - 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Proceedings of the Conference
A2 - Ku, Lun-Wei
A2 - Martins, Andre
A2 - Srikumar, Vivek
PB - Association for Computational Linguistics (ACL)
T2 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024
Y2 - 11 August 2024 through 16 August 2024
ER -