Summarizing historical text can help to better gather, organize, and share knowledge by identifying the key points in original documents. However, this comes at the cost of time and effort. Due to cultural and linguistic changes over time and the sheer volume of archives, interpreting historical text can be challenging even for experts.
Researchers at the University of Sheffield, Beihang University, and the Open University in the U.K. recently attempted to tackle this using AI and machine learning techniques. They say that their approach, which can summarize historical documents written in German and Chinese, provides a strong baseline for future studies.
The researchers chose to focus on the languages of German and Chinese for their “rich textual heritages” and “accessible” resources for historical and modern forms. Both serve as “outstanding” representatives of two distinct writing systems — German for alphabetic and Chinese for ideographic — and investigating them could lead to generalizable insights for a wide range of other languages, according to the searchers. Moreover, linguistic experts in both languages are in abundance, making modern-language summaries for German and Chinese text easy to find and use for evaluating machine learning summarization systems. .
To build a historical German language training dataset, the researchers picked newspapers from the years 1650 to 1800, randomly selecting 100 out of 383 stories for annotation. And for Chinese, they chose a collection of stories from the Wanli period of Ming Dynasty, searching over 200 related academic papers and retrieving 100 news texts. To generate summaries in the modern languages for the historical stories, the coauthors recruited two experts with degrees in Germanistik and Ancient Chinese Literature, respectively. They produced a corpus of 100 news stories and summaries in each language that were further examined by six other experts for quality control.
The researchers note that they only had summarization training data for modern German and Chinese and very limited corpora for historical forms of the languages. To get around these limitations, they used a transfer learning-based approach that they say could be bootstrapped even without cross-lingual training — i.e., training across historical and modern forms of the languages.
“Historical text summarization posits some unique challenges … Historical texts cannot be handled
by traditional cross-lingual summarizers, which require cross-lingual [training] or at least large summarization datasets in both languages,” the researchers wrote. “Further, language use evolves over time, including vocabulary and word spellings and meanings, and historical collections can span hundreds of years. Writing styles also change over time. For instance, while it is common for today’s news stories to present important information in the first few sentences, a pattern exploited by modern news summarizers, this was not the norm in older times.”
In experiments, the researchers say that automatic and human evaluations demonstrated the strength of their method over state-of-the-art baselines. In the future, they plan to improve their models to add
further languages and increase the size of the training dataset they used for each language.
“This paper introduced the new task of summarizing historical documents in modern languages, a previously unexplored but important application of cross-lingual summarization that can support historians and digital humanities researchers,” the researches wrote. “This paper is the
first study of automated historical text summarization.”
- up-to-date information on the subjects of interest to you
- our newsletters
- gated thought-leader content and discounted access to our prized events, such as Transform
- networking features, and more
Source: Read Full Article