interTwin Use Case

A Digital Twin to simulate 'noise' in Radio Astronomy

Building a digital twin of a source-telescope system that provides synthetic data, which can be used to train the expert system.

Challenge

For the first time since the field of radio astronomy was created in the middle of the 20th century we can no longer deal with the data in the traditional way. With the old radio telescopes the volume of data collected was relatively small, so the principle “the original data is sacred” was typically invoked: all the raw data were kept indefinitely or until the scientists in charge of the project decide that it is OK to delete it, which was normally years after the completion of observations. But with the new generation of telescopes like the South African MeerKAT or the Australian ASKAP,  precursors to the Square Kilometre Array (SKA) telescope we no longer have such luxury: they collect so much raw data that it is simply physically impossible to keep it for more than a few weeks, even the best data storage facilities would quickly reach their limits. Of course with such volumes it is also impossible to use experts to manually sort through the data and decide what should be kept and what can be safely deleted: this task must fall to machine learning-based automated decision-making systems.

This unfortunately raises another problem: because of the sheer volumes of data involved such systems must be HPC (high performance computing)-capable, that is to be run on modern supercomputers. Otherwise the system would not be able to keep up with the telescope in real time and the data flooding would still occur. But most modern radio-astronomical data processing tools were written with small computers in mind and so handle parallelization (the main computing principle of the HPC) poorly.

On the other hand, once an automated decision-making system is ready, it can do more than just resolve the data overflow problem. We can finally approach the most mysterious subject of radio astronomy: radio transients. Radio transients are sources that drastically change their brightness over time. In the worst cases they are just short “blips” of radio emission in random directions, one cannot predict where or when they happen and so can observe them only by an extremely lucky coincidence. Tantalizingly, it is believed that these phenomena may be produced by exotic events at extreme distances, for example by a collision of two black holes, and so they can provide us with clues about some very interesting and powerful astrophysical processes that cannot be gleaned in any other way. And an automated expert system running on a modern radio telescope surveying a large area of the sky can be taught to discern a transient and point the telescope at it. It can also trigger the “target of opportunity” mode at other observatories, making powerful infrared, optical or X-ray telescopes to point at the transient and possibly learn much more about the event: what is brief and weak in radio frequencies may be long and bright at other wavelengths.

Solution

We are developing a framework we call ML-PPA (Machine Learning-based Pipeline for Pulsar Analysis), a pilot automated expert system that can sort through radio-astronomical data streams. As targets we tackle the simplest radio astronomical sources that can still be classified as transients: pulsars. A pulsar is a radio source that emits bursts of radiation at regular intervals, from millisecond to minutes. Because of their predictability they are well studied, it is easy to observe them even with older telescopes, and so they are a great source of test data. A telescope pointed at a pulsar mostly sees “nothing”, that is just records the regular noise pattern of the sky and telescope electronics, like a TV-set with no input signal. But at times (less than 1% of the total time of observation) it sees the pulse of the pulsar, the scientifically valuable data we are after. That’s not all: since we do not live in a perfect world, all kinds of natural or human-made radio-frequency interference (RFI) can occur, and so the third general type of data is RFI. There are also other types possible, like mixed ones or something unknown. Our system reads the data stream as a sequence of short “time frames” and can be trained to assign labels to them in accordance with the type of data: “none”, “pulse”, “RFI of XXX type” etc.

However, for the system to learn this, it should be first fed already labeled data of each type, but the most important “pulse” type is much scarcer than the others. Thus we also have to build a digital twin of the source-telescope system that can provide us with synthetic data, identical to real, which can be used to train the expert system. The digital twin basically recreates the physics of the pulsar, the path of its signal in the interstellar medium and the detection process of the telescope equipment. This has the advantage of being able to tweak any parameter of any part involved and produce exactly the kind of synthetic data that is necessary. The tandem of the digital twin with the machine-learning expert system is what makes this framework both robust and versatile.

The first version of the framework ML-PPA has been released and successfully tested.

We are developing one of the first ever machine-learning based expert systems in radio astronomy not only because we hope for it to lead us to amazing scientific breakthroughs, but also because we have no other choice, and in many respects it is a process of searching for clues in the largely unknown dark territory. Luckily, interTwin provides us with a roadmap.

Yurii Pidopryhora, scientist, Max Planck Institute for Radio Astronomy

General outline of the digital twin structure in the ML-PPA framework: modelling the astrophysical source (pulsar), transmission of the signal through the interstellar matter, receiving and processing by a radio telescope, with added sources of both natural and artificial interference and noise.

Cover image: adapted from MeerKAT – SARAO “MeerKAT radio telescope, South Africa”