GenAI: streamlining document analysis
At the Centre for Media Transition, we have done a lot of research on the impacts of generative AI (genAI) on journalism and the information ecosystem (examples here and here). However, newsrooms are not the only ones testing the capabilities of this new tech. The Centre has just recently finished a trial, using genAI for our own purposes.
Newsrooms have told us that extracting information from large text corpora is one potential use of AI, which they are exploring. Its use could range from searching through historical, legal or business documents to finding newsworthy material from press releases. The appeal is obvious: have AI carry out the tedious documentary work, freeing up time for journalists to review and focus on other high-value tasks such as analysis.
As academics, we also regularly face situations where we must analyse large volumes of documents. One such recent task was an analysis of the different viewpoints presented by submissions to last year’s government consultation, which proposed giving the Australian Communication and Media Authority (ACMA) new powers to combat misinformation and disinformation. In total, there were 2,418 public submissions. Traditionally, this would be a manual, time-consuming project, where each submission would be reviewed, thoroughly testing human concentration spans and patience. AI, thankfully, has unlimited of both (or whatever is considered the silicon-based equivalent)!
In a nutshell, the method we used made use of OpenAI’s GPT4o model and an open-source PDF conversion tool called Marker. Once the PDF submissions had been converted to markdown (a type of text formatting), the GPT4o model was prompted to evaluate each submission on a set of very specific and direct questions we had formulated. These questions asked the model to extract information on issues such as whether the submission provided commentary on the definition of ‘misinformation’ or whether the submission supported excluding professional news content. Crucially, for each response, the model was asked to provide a direct quote from the submission alongside reasoning as to why it came to a decision, allowing for quicker validation when reviewing the results.
This project has, of course, not been without its challenges. Ethics and issues such as bias and the reproducibility of genAI content had to be carefully considered. Testing for reliability and accuracy became the bulk of the time spent on the task. That being said, even with the additional verification steps and significant period of trial and error testing, we were able to analyse these submissions in a fraction of the time and resources a regular manual human review would have taken.
The results are shaping up to be quite insightful, and we look forward to sharing them in a future newsletter. In the meantime, the success here has buoyed my optimism that genAI, whether in academia or a newsroom, can speed up the tedious, sparing time for more enjoyable or harder-hitting tasks.
Kieran Lindsay, CMT Research Officer