In this talk, I presented a paper whose goal is to implement—in an actual neural-operator architecture—the probabilistic and geometric requirements that become fundamentally different in a Hilbert-space setting.
The starting point is that the score is the logarithmic derivative along Cameron–Martin directions, and therefore it naturally lives in a specific Cameron–Martin space, namely the RKHS induced by the covariance. This observation essentially fixes the design principle: the score network should be constructed so that its output remains in that RKHS. A kernel neural operator (KNO) is a natural candidate for enforcing this. However, the kernel-based operator blocks used in standard KNOs suffer from rapidly growing computation and memory costs as the resolution increases, and they are also difficult to interpret from a principled standpoint.
To address these issues, the paper proposes a Galerkin-type KNO. The key idea is to approximate the kernel by a finite-dimensional feature inner product and to rearrange the core computations accordingly. This immediately raises a practical question: which features should be used to approximate the kernel? The authors argue that the kernel should be allowed to change its shape over diffusion time. They therefore parameterize a GSM-family kernel via a time-conditioned hypernetwork, and, recognizing that a single approximation mechanism is insufficient, they combine random Fourier features (RFF) with polynomial-based features. The crucial point is not a naive concatenation: they learn a time-dependent gate that controls the relative weight of the two feature blocks, so the model can select the kernel characteristics needed at each time while keeping a fixed feature budget. When this Galerkin-type KNO is used as the operator block of the Hilbert-diffusion score network, the model propagates function values by combining function values, coordinates, and time, and performs a linearized kernel-integral update at each block.
The training objective follows the standard denoising paradigm (noise prediction) used in Euclidean diffusion models, but it reflects the fact that on random grids the noise is not i.i.d. Gaussian; it is a correlated Gaussian determined by the kernel covariance. This creates a major bottleneck: it is infeasible to factorize the covariance matrix each time the grid changes in order to sample Hilbert noise exactly. The paper resolves this with a Nyström approximation. It computes the covariance structure once on a fixed landmark grid, and then transfers that structure to arbitrary random grids via cross-kernels, enabling efficient generation of covariance-matched noise without repeated large-scale decompositions.
Empirically, the authors compare models on several one-dimensional synthetic function families, training on random grids and evaluating on a common uniform grid. Generative quality is assessed using two-sample-test-based metrics and the first-moment error. The results indicate that, compared to FNO, the Galerkin-type KNO improves both generation quality on random grids and computational efficiency while maintaining RKHS consistency.
The most convincing part of the paper is the theoretical interpretation developed in the later sections. Rather than treating the kernel-integral update as merely a “plausible attention variant,” the authors reformulate it from the perspective of Galerkin approximation of the solution to a regularized regression problem with weighted observation error and an RKHS penalty. Under this view, the fact that the network operates with a fixed feature budget becomes the statement that it performs a Galerkin approximation to kernel ridge regression on a finite-dimensional subspace of the RKHS, which enables analysis of stability and approximation error. Moreover, the implementation choice of avoiding an explicit matrix inverse can be interpreted as a one-step unrolling of an iterative method in feature space, allowing the resulting error reduction to be quantified.
Q: Does implementing random grids via the Nyström approximation make the model “more resolution-free”?
A: No. The Nyström approximation does not change the model’s intrinsic resolution-free property. The resolution-free behavior is primarily determined by the operator architecture.
Q: All training was conducted on random grids, but since there is no comparative experiment on a uniform grid, it is hard to assess how effective the proposed Nyström approximation actually is. What are the experimental results on a uniform grid?
A: In our implementation, a random grid is handled by directly providing the model with the coordinate information of the function. However, this design choice by itself does not fundamentally change the resolution-free property that is built into the model. As a result, it is not straightforward to disentangle whether the strong performance comes from supplying the model with additional “coordinate information,” or from the fact that the model is exposed to a wider variety of coordinate sets during training. Resolving this requires additional experiments.
Q: When designing the explicit feature map, why did you use both RFF and polynomial features? Isn’t a single approach sufficient?
A: We use both because they are complementary approximations of the target kernel class, and neither one alone is reliable across the regimes the model must handle.
RFF is most effective when the kernel is well represented through its spectral density — this is the shift-invariant component. In that regime, RFF gives an accurate approximation. Polynomial-based features address a different failure mode: on bounded domains and in settings where boundary effects, nonstationary behavior, or local structure matters, a purely spectral approximation can be inefficient or biased unless the feature budget is made very large.
In this talk, I presented a paper whose goal is to implement—in an actual neural-operator architecture—the probabilistic and geometric requirements that become fundamentally different in a Hilbert-space setting.
The starting point is that the score is the logarithmic derivative along Cameron–Martin directions, and therefore it naturally lives in a specific Cameron–Martin space, namely the RKHS induced by the covariance. This observation essentially fixes the design principle: the score network should be constructed so that its output remains in that RKHS. A kernel neural operator (KNO) is a natural candidate for enforcing this. However, the kernel-based operator blocks used in standard KNOs suffer from rapidly growing computation and memory costs as the resolution increases, and they are also difficult to interpret from a principled standpoint.
To address these issues, the paper proposes a Galerkin-type KNO. The key idea is to approximate the kernel by a finite-dimensional feature inner product and to rearrange the core computations accordingly. This immediately raises a practical question: which features should be used to approximate the kernel? The authors argue that the kernel should be allowed to change its shape over diffusion time. They therefore parameterize a GSM-family kernel via a time-conditioned hypernetwork, and, recognizing that a single approximation mechanism is insufficient, they combine random Fourier features (RFF) with polynomial-based features. The crucial point is not a naive concatenation: they learn a time-dependent gate that controls the relative weight of the two feature blocks, so the model can select the kernel characteristics needed at each time while keeping a fixed feature budget. When this Galerkin-type KNO is used as the operator block of the Hilbert-diffusion score network, the model propagates function values by combining function values, coordinates, and time, and performs a linearized kernel-integral update at each block.
The training objective follows the standard denoising paradigm (noise prediction) used in Euclidean diffusion models, but it reflects the fact that on random grids the noise is not i.i.d. Gaussian; it is a correlated Gaussian determined by the kernel covariance. This creates a major bottleneck: it is infeasible to factorize the covariance matrix each time the grid changes in order to sample Hilbert noise exactly. The paper resolves this with a Nyström approximation. It computes the covariance structure once on a fixed landmark grid, and then transfers that structure to arbitrary random grids via cross-kernels, enabling efficient generation of covariance-matched noise without repeated large-scale decompositions.
Empirically, the authors compare models on several one-dimensional synthetic function families, training on random grids and evaluating on a common uniform grid. Generative quality is assessed using two-sample-test-based metrics and the first-moment error. The results indicate that, compared to FNO, the Galerkin-type KNO improves both generation quality on random grids and computational efficiency while maintaining RKHS consistency.
The most convincing part of the paper is the theoretical interpretation developed in the later sections. Rather than treating the kernel-integral update as merely a “plausible attention variant,” the authors reformulate it from the perspective of Galerkin approximation of the solution to a regularized regression problem with weighted observation error and an RKHS penalty. Under this view, the fact that the network operates with a fixed feature budget becomes the statement that it performs a Galerkin approximation to kernel ridge regression on a finite-dimensional subspace of the RKHS, which enables analysis of stability and approximation error. Moreover, the implementation choice of avoiding an explicit matrix inverse can be interpreted as a one-step unrolling of an iterative method in feature space, allowing the resulting error reduction to be quantified.
Q: Does implementing random grids via the Nyström approximation make the model “more resolution-free”?
A: No. The Nyström approximation does not change the model’s intrinsic resolution-free property. The resolution-free behavior is primarily determined by the operator architecture.
Q: All training was conducted on random grids, but since there is no comparative experiment on a uniform grid, it is hard to assess how effective the proposed Nyström approximation actually is. What are the experimental results on a uniform grid?
A: In our implementation, a random grid is handled by directly providing the model with the coordinate information of the function. However, this design choice by itself does not fundamentally change the resolution-free property that is built into the model. As a result, it is not straightforward to disentangle whether the strong performance comes from supplying the model with additional “coordinate information,” or from the fact that the model is exposed to a wider variety of coordinate sets during training. Resolving this requires additional experiments.
Q: When designing the explicit feature map, why did you use both RFF and polynomial features? Isn’t a single approach sufficient?
A: We use both because they are complementary approximations of the target kernel class, and neither one alone is reliable across the regimes the model must handle.
RFF is most effective when the kernel is well represented through its spectral density — this is the shift-invariant component. In that regime, RFF gives an accurate approximation. Polynomial-based features address a different failure mode: on bounded domains and in settings where boundary effects, nonstationary behavior, or local structure matters, a purely spectral approximation can be inefficient or biased unless the feature budget is made very large.