Using AI to analyse politician's speeches in real time
As software engineers, we don't particularly specialise in foreign politics. However, when Sander and I (Thomas) had the chance to unleash the power of AI on upcoming Swedish political speeches, we were happy to jump at the opportunity. It does, however, require a bit of context to understand where our interest in Swedish politics had come from.
Q42 is part of Eidra, a group of creative and consultancy companies that's headquartered in Sweden. One political tradition in Sweden is the Almedalsveckan, a week in which all of political Sweden swarms to the Almedalen park on the island of Gotland. Imagine a nice summer picnic in the park, but with suits instead of shorts, and about 30.000 people instead of 3. At the end of every day it's a different party leader's turn to convince the crowds their party is what is best for Sweden in an afternoon speech. Eidra would be present at the 2024 Almedalsveckan, which leads us to the involvement of Q42.
Building a prototype
At Almedalen, Eidra would be welcoming people into a garden to watch the broadcast of the political party leader's speech of that day whilst enjoying a drink in the evening sun. For this garden, they wanted something that would encourage guests to think about how AI could be useful to them, as well as providing a novel conversation starter.
The idea was to add a second screen to the speech where real-time analytics for the speech would be shown. So, for example, if a politician were to make a promise in their speech, it would then show up on the screen. The system would be called "Klartext" (plain language, in Swedish). So, mid April, two and a half months before this year's event, the Eidra Almedalen team reached out to us at Q42. They wanted to know if we could build a Klartext prototype. Building this prototype would validate whether the real time analysis would be possible, as well as serve as a springboard for other developers within Eidra to build a production ready version of the AI app (foreshadowing).
So, eager to see how much sense AI can make of political lingo, my colleague Sander and I got to work on the prototype.
What do we want to measure
The first part of this project was to figure out what it was that we wanted to measure in each speech. Together with our Swedish colleagues, we started writing out a bunch of Key Performance Indicators (KPIs) that we thought would be interesting. These ranged from keeping track of which promises the politician made, to scoring how rational or emotional their speech was. Besides just measuring, we thought it would be fun to also leverage some generative AI, by having the system create summary images of the speeches.
How do we measure
After we had finished our initial list of KPIs, Sander got to work on building a front-end that could show the KPIs in a visually appealing manner, and I got started on figuring out how we could process the speeches into KPIs. The processing could be broken down into three distinct questions:
- How do we get the speech into the system?
- How do we turn the speech into KPIs?
- How do we set up the KPIs so that we can easily add and edit them?
Especially the last one was important since AI based systems often require a lot of tweaking, but more about that later.
To get some clarity on how this all could work, I got to work on one of my personal favourite activities: making little system architecture sketches (we don't call Q42 a "happy place for nerds" for nothing).
When building something from scratch, these diagrams are a great way to both organise your thoughts, as well as convey them to others for discussions and feedback.
This diagram shows what the system could look like. Everything in dashed lines was out of scope for the very first iteration of this system. On the left hand side, we have everything that needs to run on-site, and on the right we have everything that can run remotely.
Listening to the speech
Getting the speech into the system is where we encountered our first challenge. As it turns out, at the time of development, there weren't really any APIs or services available that would suit our needs (gpt-4o hadn't even been announced yet). There were services that could do live speech transcription, but these often had some constraints. They would for example be limited to just the English language, or would require you to specify a hard-coded language, which would be problematic when a politician would, for example, use an English quote in their Swedish speech. What did help for us is that our transcript did not need to be super low latency, as we weren't doing anything like real-time subtitling.
Using BlackHole to do some virtual audio routing, I was able to route the audio from the speech livestream into a program of my choosing. Then I wrote a little Python application that would listen to the virtual BlackHole audio input. Whenever a pause in speech was detected, the application would save all the previous audio to a chunk and start a new chunk. There are pip packages that are really helpful here. Once a chunk was complete, we'd send it off to OpenAI's Whisper API to have it transcribed. This meant that we were continuously uploading small audio segments of the speech and getting transcripts back for them.
One interesting finding here was that you need to give Whisper enough context for it to determine what language it's listening to. If we fed it very short segments of Dutch speech, it would occasionally mistake it for German. By setting a minimum chunk length, and merging chunks to reach that threshold, we could ensure that the audio segments we sent to Whisper were long enough to be properly transcribed.
Processing the speech
With the Python application running, we were getting a steady stream of transcript segment events. This meant we could now focus on how we would actually process the transcript into KPIs as well as how we could keep the system flexible.
That last element is especially important when working with Large Language Models (LLMs). Unlike 'regular' programming, which is very deterministic, working with LLMs can be quite fuzzy. From the start, it was clear that a significant effort in prompt engineering would have to be made in order to get the LLM to do exactly what we wanted it to do. To avoid unexpected behaviour, it's also generally a good idea to give an LLM one task at a time to work on.
To facilitate prompt engineering and keep the various KPIs out of each other's way, I decided to build the analysis system as a sort of framework, where all the KPIs are defined in a configuration file. Upon launching, the system spins up a separate 'chat' with ChatGPT for each KPI that is defined in the configuration. At the start of each chat, we inject the instructions for that particular KPI as a system prompt, and after that, we continuously follow up with new bits of transcript. This allows the LLM to have access to the entire speech at all times as well as 'remember' its own earlier replies.
This configuration setup allows us to quickly add or tweak KPIs without having to change any of the business logic. If we define five KPIs, we will have five concurrent chats going on. Each one reporting on its own KPI. If we want a sixth KPI, we can just add it to the config, and the system will pick it up automatically. Here’s an example of one of those KPI configs:
RationalScore: {
type: "numeric",
prompt: `
The metric you are tracking is the rational score. This is a score that represents how rational or emotional the reasoning in the speech is.
You can score the speech on a scale of -10 to 10, where -10 is completely emotional and 10 is completely rational.
As the score gets closer to the edges of the scale, increasingly more is required to move it even closer to the edges (exponential scale, so the difference between -9 and -10 is about the same as the difference between 0 and -9).
Every rational argument moves the score towards 10, while every emotional argument moves the score towards -10.
Your response should look like:
{
"value": 0
}
Where the value represents the score you would give the speech. Always take the entire speech into consideration, not just the latest bits of transcript.
`}
The prompt for each KPI worker is also preceded by a general prompt which instructs the LLM to output in JSON format and respond with “{}” when there is nothing new to report on.
Demonstrating the prototype
After a few days of coding, we had a first complete version of both the front-end and the back-end. The system was still very much a prototype and definitely had some rough edges. But within about a week's worth of time, we were able to build something that dynamically reported on KPIs based on real-time speech. It was a lot of fun to see how you can build a prototype and validate an idea with modern AI services / APIs when you have limited time and resources available.
Since our quick prototype had shown that Klartext could become a reality, it was full steam ahead on building the real thing. As we had thoroughly enjoyed working on the project, both Sander and I were eager to take this prototype to production. There was just one issue though. Neither Sander nor myself speak Swedish, which would be a problem when writing and evaluating Swedish LLM prompts. Furthermore, it would be beneficial to have someone on-site in Sweden who knew all the ins and outs of the system. So it was time to bring in our secret weapon, our colleague Alexander, from Q42's Swedish sister company Above. We handed off our very early stages prototype to him to make it ready for prime-time.
Real-Time Transcription and AI Analysis in Production
From the moment Thomas and Sander shared the Klartext prototype with me, I (Alexander) was thrilled to join the project. As a member of the Design Technology Team at Above – a Swedish product and innovation agency – I was no stranger to working with prototypes and rapidly bringing them into early production. However, this time the project felt especially exhilarating. The challenge wasn't just about rapidly moving a prototype into production. We also had to adapt to the novel approach of working with generative AI's probabilistic outputs, all while maintaining a sharp focus on current Swedish political discourse. How could we scale, test and dry-run such a system before going live, especially with no knowledge of this year's topics of the speeches and no access to any transcripts beforehand?
Adding to the complexity was the tight timeline and full exposure from day one at Almedalsveckan. At the time when I took over, we had already announced a live event where Klartext's real-time results would be displayed on a big screen at Eidra's venue in the middle of Visby, in front of a large audience. TV and radio interviews that would discuss the app and the outcomes were scheduled, and some party leaders would be confronted with the analysis immediately after their speeches. None of us wanted this to fail – whether due to lack of results or, worse, nonsensical or wrong ones – and even a half-hour delay was unacceptable. We were betting big that this would somehow work.
Transitioning the prototype into a production-ready system for Almedalsveckan therefore presented quite a few expectations but also exciting challenges.
At the content level, we aimed to develop and test a set of meaningful KPIs for the speeches, including sentiment analysis of key words, detection of critical elements like election pledges or greater themes of the political agenda, and a general score-based evaluation of the rhetorical characteristics of the speech.
At the technical level, our objectives were to maintain low latency and ensure scalability for at least eight scheduled speeches, while allowing for an open number of new KPIs. And, of course, we aimed to keep the entire pipeline running smoothly under a potentially heavy user load.
Towards an event-driven cloud architecture
While we didn't expect KPI results to appear on the screens within milliseconds, we aimed to display results in under a minute. Therefore, we ported the initial prototype — where a single machine handled everything from transcription to analysis, writing results into non-permanent storage — into a Firebase architecture. Firebase offered a straightforward, all-in-one solution, providing several important advantages worth mentioning.
First off, we could now implement the KPI analysis into separate Cloud Functions, unlocking several core benefits for the overall architecture.
Scalability:
By implementing KPI analysis as individual Cloud Functions, each KPI now scales independently based on demand. For instance, when a high volume of transcripts needs to be analysed, the system can handle multiple KPIs per speech without bottlenecks. This separation allows us to accommodate a growing number of KPIs without compromising performance.
Isolation:
Each analysis runs in its own isolated function, ensuring consistent and predictable system performance, even when handling complex or resource-intensive calculations. For instance, a slower analysis for one KPI won't impact the performance of others.
Event-Driven Architecture:
Leveraging Firebase's event triggers has been pivotal in streamlining the AI-driven KPI analysis process. With an event-driven setup, the KPI analysis only starts when needed — specifically, when a new transcript arrives from the transcription service. This event-based triggering eliminates unnecessary processing or long running scheduling routines, and ensures resources are only consumed when required.
In the second step, the front-end was reengineered to fully leverage Firestore's real-time data features. Instead of polling for updates, we set up subscriptions to real-time changes in the Firestore database, allowing us to keep the views updated without delay as soon as new results became available. Due to limited time and resources, we focused on a mobile-first UX, anticipating that most users would follow the analysis from their mobile devices while attending the speeches or participating in other activities at Almedalsveckan. With great help from Mart-Jan and Anton from Q42, we implemented a carousel-based design, enabling seamless navigation between the different KPI views for each speech.
Introducing Speech Configurations and Event States
To manage the lifecycle of Klartext events, we introduced some new data model structures, such as speech configurations (i.e. information about the party and the speaker, scheduled speech time) and transcription states (i.e. scheduled, ongoing, finished). These features were particularly useful for a festival-like live event like Almedalsveckan. With the additional help from some prepared admin API endpoints, we could update settings and schedules across the full stack quickly, and roll out last-minute changes to the frontend as needed.
Further optimizations: finding the right tools for the right tasks, at the right time
In addition to using an LLM like ChatGPT for qualitative KPIs, we integrated 'sv_core_news_sm' – a smaller Natural Language Processing model trained on Swedish – to handle lightweight quantitative tasks. These tasks included retrieving keyword frequency and calculating unique words per 100 words. This reduced the number of expensive and time-consuming LLM calls, balancing both performance and cost.
We also optimised how KPIs reacted to growing speech transcripts, for example:
- Some KPIs, like sentiment analysis, processed smaller and more recent transcript chunks to provide faster updates.
- Other KPIs, like theme analysis, accumulated the transcribed context over longer periods of time for more comprehensive insights, following a logic similar to how you need to hear a joke in its entirety to understand the punchline.
Finally, in the fear of anything from power blackouts, internet outages or other bad surprises for tech people that need to be live, a backup system was deployed to provide some redundancy during live events, transcribing speeches on a separate machine using a different internet connection, with a mechanism to seamlessly switch between systems in case of failure.
Lessons Learned and Looking Ahead
Almedalsveckan 2024 marked the first time we launched our new Klartext system, and we gained valuable insights and learnings throughout the process. Key takeaways include:
- A reliable transcript quality is of superior importance:
Just as a chef can only create a dish as good as the ingredients they start with, our AI analysis pipeline can only produce analyses and insights as valuable as the raw transcript data we feed into it. It was important to take extra care that the Whisper model did not add any hallucinations (such as categorically adding a “Thank you! Thank you!” into the transcript whenever the audience applauded), or misinterpret language in shorter phrases. - Test comprehensively:
Due to its probabilistic nature, the real-time AI analysis required extensive testing and prompt engineering. Testing the pipeline well in advance, using a wide range of past speech recordings whilst carefully calibrating the parameters of OpenAI’s API was essential for ensuring reliability during live events. Complex rhetorical figures that spread over multiple transcript chunks, indirect speech, and references to political developments after the LLMs cutoff date turned out to be especially challenging. - Spend time to balance performance, cost and context:
Managing API call frequency, especially for costly tasks like LLM queries, was crucial for keeping runtime costs low without sacrificing accuracy.
By integrating the KPI analysis into a cloud-based, serverless, event-driven architecture, we eventually transformed the prototype into a more robust, scalable Klartext system that was ready for real-time AI analysis during ongoing speeches.
Over the four consecutive days at Almedalsveckan, the successful launch of Klartext (all results are accessible via klartext.eidra.com), with hundreds of guests visiting the Eidra venue for the analysis, not only enriched the event but also showcased the transformative potential of AI in making political discourse more accessible and engaging. Right on the first day, when confronted with the results of the analysis, a party leader and speaker confirmed the overall accuracy of the results. Our app sparked meaningful conversations about technology's role in society, and the fruitful collaboration between the sister companies Q42 and Above opened up new avenues for innovation amongst the greater multi-national consulting group of Eidra.
A few months after Almedalen, we developed the system further by adding a multi-speaker mode, and had Klartext analyse the presidential debate between Kamala Harris and Donald Trump. Looking ahead, we’re excited to build upon these foundations, exploring new features and applications that can further contribute to bridging the gap between complex information and greater understanding and comparability in real time.
Do you also love making impact with technical innovations like AI? Then do check out our job vacancies at werkenbij.q42.nl!