In the recent field of generative modeling, diffusion models and flow matching, in particular, have demonstrated exceptional performance in mimicking data distributions. However, they suffer from a fundamental limitation: they require multiple iterative neural function evaluations (NFEs) during inference. To overcome this, there has been continuous research into one-step generation techniques that distill multi-step models or approximate the trajectories of ordinary/stochastic differential equations (ODEs/SDEs). In contrast, this paper proposes "Drifting Models," introducing an entirely different conceptual paradigm. The core idea of Drifting Models is to shift the iterative sample evolution process—traditionally performed at inference time—to "training time," where deep learning optimization occurs.
Fundamentally, generative modeling can be viewed as learning a pushforward operation that maps a prior distribution, such as random noise, to the data distribution. While existing models decompose a complex pushforward map into multiple steps applied progressively during inference, the proposed method evolves the pushforward distribution itself—generated by a single-pass neural network—so that it aligns with the target data distribution through iterative weight updates during training.
To mathematically derive this distribution evolution, the paper introduces a novel concept called a "drifting field." This field acts as a vector field that determines the direction and magnitude of the drift for individual samples, based on the discrepancy between the generated and real data distributions. Crucially, this field is rigorously designed to exhibit anti-symmetry, ensuring that it reaches an equilibrium where all sample movement halts when the two distributions match perfectly.
Specifically, to allow computation via empirical samples even without perfect knowledge of the target distribution, the drifting field is constructed from the interaction between attraction driven by real data (positive samples) and repulsion driven by generated data (negative samples). This is calculated as the difference between weighted average vectors using a kernel function, inspired by classical data clustering techniques like the mean-shift algorithm and contrastive learning.
The training objective is formulated as a simple Mean Squared Error (MSE) loss, guiding the samples generated by the current neural network to move directly toward the target points computed by this drifting field. A stop-gradient operation is applied to these target points, allowing the neural network to track the direction of sample evolution and achieve stable optimization without relying on complex adversarial training (GANs) or integral equations.
To drastically enhance the generation quality of high-dimensional data such as high-resolution images, the authors opted to compute this drifting loss in a pre-trained feature space rather than the raw pixel space. By extracting multi-scale spatial feature maps using multi-stage encoders like ResNets or self-supervised Masked Autoencoders (MAEs), and computing the sum of kernel-based sample similarities and drifts across various scales and locations, they provide the neural network with morphologically and semantically rich training signals.
Furthermore, the Classifier-Free Guidance (CFG) mechanism—a cornerstone of modern image generation—is seamlessly integrated into the training process itself, rather than being applied at inference time. By directly incorporating the theoretical formula of CFG—which linearly combines conditional and unconditional data distributions—into the sampling weights of negative samples within the training batch, the model can generate high-quality images with the full CFG effect applied using only a single function evaluation (1-NFE) during inference, eliminating the need for an additional unconditional neural network evaluation.
In large-scale experiments on ImageNet at a 256x256 resolution, the proposed Drifting Models achieved an FID (Fréchet Inception Distance) of 1.54 for latent space generation and 1.61 for direct pixel space generation. These results establish a new state-of-the-art, overwhelming the performance of all existing distillation-based and non-distillation-based one-step generation models. Remarkably, despite the relatively small model size, these figures are highly competitive even when compared to large diffusion models that require dozens of sampling steps.
In conclusion, this research completely departs from the paradigm of approximating trajectories of complex differential equations, upon which diffusion models and flow matching are firmly built. Instead, it reinterprets the iterative gradient descent optimization process inherent to neural networks as a mechanism for the progressive evolution of probability distributions. By pushing inference speed and computational efficiency to their absolute limits while maintaining top-tier data generation quality, this intuitive and elegant approach demonstrates profound potential. It is poised to become a powerful and highly original alternative in designing the architecture of universal generative AI models spanning both visual intelligence and control systems in the future.
Q1. Equation (11) defines the drifting field. Looking at the structure of the equation, it seems that we need to calculate pairs for x and y+, x and y-, and y+ and y-, as well as compute a double expectation. Doesn't this cause excessive computational complexity?
A1. No, it does not cause excessive computational complexity. By understanding the difference between the mathematical formulation of Equation (11) and its actual implementation in code, we can see that the computational cost does not burden the training process.
Equation (11) is the result of combining and expanding Equations (8) and (10). However, if you look at the actual implementation in Algorithm 2 (Appendix A.1), the system only calculates the distances between x(generated samples) and y+(real data), and between x and y- (other generated samples).
When calculating the weighted averages of (y+ - x) and (y- - x) from Equation (8) and subtracting them Equation 10, the common -x term cancels out. This leaves only the difference between the weighted y+ and y-. Therefore, there is absolutely no need to compute the distances or interactions directly between y+ and y-
and The double expectation is handled via simple matrix multiplication. The paper approximates this expectation using empirical means within a mini-batch. As shown in Algorithm 2, this is computed all at once through simple matrix multiplications between the weight matrix derived from kernel similarities and the sample matrix. This operation is extremely fast on GPU architectures.
The complexity of these distance calculations is bounded by O(N x N_{pos} x D) and O(N x N_{neg} x D). Compared to the massive forward pass computations required by the heavily parameterized Generator (DiT) and multi-scale Feature Encoder, the cost of these matrix operations is practically zero. The authors explicitly confirm this in Appendix A.5, stating: "once the feature encoder is run, the computational cost of our drifting loss is negligible."
Q2. The proposed Drifting field V in the paper seems to play the same role as the score function in score-based models. However, Vis not a target of learning. Specifically, what are the similarities and differences between V and the Score function?
A2. the Drifting field V and the Score function in score-based models share a conceptual root: they both act as a "vector field that moves samples toward the true data distribution." However, due to the paradigm shift proposed in this paper, they have fundamental differences in terms of what the neural network learns, when the iterative movement occurs, and how they are formulated. Here are the specific similarities and differences.
Both concepts serve as vectors in a high-dimensional space indicating the direction and magnitude by which a sample (noise or intermediate data) should move to align with the target data distribution.
The most critical differences lie in "what learns what?" and "when does the iterative movement happen?"
1. Target of Learning vs. Supervisory Signal
Score Function: It is the gradient of the log probability density of perturbed data. In score-based models, the neural network itself is the target being trained to approximate this score function. Thus, the network's output is the vector field.
Drifting Field V: It is not a learned target. V is a non-parametric mathematical formulation calculated explicitly during a training batch via kernel-based interactions (attraction and repulsion) between generated samples and real samples. V serves as the supervisory signal (the target in the loss function) that guides the Generator network to update its weights and move the samples to the correct positions.
2. Timing of Iteration: Inference Time vs. Training Time
Score Function: After training, it is applied iteratively (tens to hundreds of times) during inference using numerical ODE solvers or Langevin dynamics to progressively denoise a sample.
Drifting Field V: It is applied iteratively during the training process (e.g., via SGD or Adam). At each training iteration, V is computed to slightly evolve the generator's pushforward distribution. As a result, once training is complete, the generator produces perfect images in a single pass (1-NFE) during inference, and V is not used at all at inference time.
Q3. The paper strongly emphasizes computing the Drifting Loss in a pre-trained "Feature Space" rather than the raw data space (pixel or latent space), even noting that the model fails on complex datasets like ImageNet without a feature encoder. Why does the feature space play such an indispensable role in this model?
A3. The core mechanism of Drifting Models relies on a kernel function k(x, y) to measure similarities between generated samples x and real data y+, which dictates the pull of the drifting field. The paper defines this kernel based on the L_2 distance (Equation 12).
However, in high-dimensional raw pixel spaces or simply compressed latent spaces, the numerical difference in pixel values does not accurately reflect semantic similarity. In raw space, the distance between two different cat images might be just as large as the distance between a cat and a car. Consequently, all samples appear far apart, making the kernel "flat." When this happens, the drifting field V vanishes, failing to provide meaningful gradient signals for training (as mentioned in Section 5.2).
In contrast, feature encoders pre-trained via Self-Supervised Learning (SSL, e.g., MAE, MoCo) are designed to map semantically similar samples close together in the feature space. By computing distances and kernels in this space, the model guarantees that a generated sample receives a strong, accurate signal to drift toward the real data that is semantically closest to it, providing rich training signals.
Q4. In the toy experiments (Section 5.1) of the paper, even when the generated distribution q is terribly initialized in a collapsed, single-mode state, it successfully discovers all multiple modes of the target distribution p. How do Drifting Models inherently overcome the "mode collapse" problem that commonly plagues general generative models (like GANs), and specifically, what roles do the "Repulsion" term in Equation (10) and "Anti-symmetry" play in this process?
A4. Drifting Models are robust against mode collapse because their objective function is designed to model "distribution-to-distribution interactions" rather than single-sample trajectories. In this mechanism, the Repulsion/Attraction term act as critical mechanisms.
According to Equation (10) in the paper, the drifting field is formed by the attraction of real data minus the repulsion of generated data.
Suppose the generative model suffers from mode collapse, causing all generated samples x to cluster tightly in a single, narrow region. In this scenario, the distance between the generated samples becomes extremely small, causing the value of the kernel function k(x, y^-) to skyrocket. Consequently, the generated samples exert an immensely strong repulsive force on one another. This repulsion mathematically forbids the samples from staying clustered; it violently pushes the trapped samples outward in all directions, forcing them to explore the broader space.
Now, as the samples are pushed outward by repulsion and escape their collapsed state, they simultaneously feel the attractive force of the real data. In the vicinity of the initially collapsed mode, the strong repulsive force cancels out the attractive force. However, in other modes of the real data that the generator has not yet reached (empty modes), there are no generated samples nearby to create repulsion. Therefore, only a pure, unopposed attractive force exists in those areas. As a result, the dispersed samples are naturally, rapidly, and strongly pulled into these undiscovered modes of the target distribution.
In the recent field of generative modeling, diffusion models and flow matching, in particular, have demonstrated exceptional performance in mimicking data distributions. However, they suffer from a fundamental limitation: they require multiple iterative neural function evaluations (NFEs) during inference. To overcome this, there has been continuous research into one-step generation techniques that distill multi-step models or approximate the trajectories of ordinary/stochastic differential equations (ODEs/SDEs). In contrast, this paper proposes "Drifting Models," introducing an entirely different conceptual paradigm. The core idea of Drifting Models is to shift the iterative sample evolution process—traditionally performed at inference time—to "training time," where deep learning optimization occurs.
Fundamentally, generative modeling can be viewed as learning a pushforward operation that maps a prior distribution, such as random noise, to the data distribution. While existing models decompose a complex pushforward map into multiple steps applied progressively during inference, the proposed method evolves the pushforward distribution itself—generated by a single-pass neural network—so that it aligns with the target data distribution through iterative weight updates during training.
To mathematically derive this distribution evolution, the paper introduces a novel concept called a "drifting field." This field acts as a vector field that determines the direction and magnitude of the drift for individual samples, based on the discrepancy between the generated and real data distributions. Crucially, this field is rigorously designed to exhibit anti-symmetry, ensuring that it reaches an equilibrium where all sample movement halts when the two distributions match perfectly.
Specifically, to allow computation via empirical samples even without perfect knowledge of the target distribution, the drifting field is constructed from the interaction between attraction driven by real data (positive samples) and repulsion driven by generated data (negative samples). This is calculated as the difference between weighted average vectors using a kernel function, inspired by classical data clustering techniques like the mean-shift algorithm and contrastive learning.
The training objective is formulated as a simple Mean Squared Error (MSE) loss, guiding the samples generated by the current neural network to move directly toward the target points computed by this drifting field. A stop-gradient operation is applied to these target points, allowing the neural network to track the direction of sample evolution and achieve stable optimization without relying on complex adversarial training (GANs) or integral equations.
To drastically enhance the generation quality of high-dimensional data such as high-resolution images, the authors opted to compute this drifting loss in a pre-trained feature space rather than the raw pixel space. By extracting multi-scale spatial feature maps using multi-stage encoders like ResNets or self-supervised Masked Autoencoders (MAEs), and computing the sum of kernel-based sample similarities and drifts across various scales and locations, they provide the neural network with morphologically and semantically rich training signals.
Furthermore, the Classifier-Free Guidance (CFG) mechanism—a cornerstone of modern image generation—is seamlessly integrated into the training process itself, rather than being applied at inference time. By directly incorporating the theoretical formula of CFG—which linearly combines conditional and unconditional data distributions—into the sampling weights of negative samples within the training batch, the model can generate high-quality images with the full CFG effect applied using only a single function evaluation (1-NFE) during inference, eliminating the need for an additional unconditional neural network evaluation.
In large-scale experiments on ImageNet at a 256x256 resolution, the proposed Drifting Models achieved an FID (Fréchet Inception Distance) of 1.54 for latent space generation and 1.61 for direct pixel space generation. These results establish a new state-of-the-art, overwhelming the performance of all existing distillation-based and non-distillation-based one-step generation models. Remarkably, despite the relatively small model size, these figures are highly competitive even when compared to large diffusion models that require dozens of sampling steps.
In conclusion, this research completely departs from the paradigm of approximating trajectories of complex differential equations, upon which diffusion models and flow matching are firmly built. Instead, it reinterprets the iterative gradient descent optimization process inherent to neural networks as a mechanism for the progressive evolution of probability distributions. By pushing inference speed and computational efficiency to their absolute limits while maintaining top-tier data generation quality, this intuitive and elegant approach demonstrates profound potential. It is poised to become a powerful and highly original alternative in designing the architecture of universal generative AI models spanning both visual intelligence and control systems in the future.
Q1. Equation (11) defines the drifting field. Looking at the structure of the equation, it seems that we need to calculate pairs for x and y+, x and y-, and y+ and y-, as well as compute a double expectation. Doesn't this cause excessive computational complexity?
A1. No, it does not cause excessive computational complexity. By understanding the difference between the mathematical formulation of Equation (11) and its actual implementation in code, we can see that the computational cost does not burden the training process.
Equation (11) is the result of combining and expanding Equations (8) and (10). However, if you look at the actual implementation in Algorithm 2 (Appendix A.1), the system only calculates the distances between x(generated samples) and y+(real data), and between x and y- (other generated samples).
When calculating the weighted averages of (y+ - x) and (y- - x) from Equation (8) and subtracting them Equation 10, the common -x term cancels out. This leaves only the difference between the weighted y+ and y-. Therefore, there is absolutely no need to compute the distances or interactions directly between y+ and y-
and The double expectation is handled via simple matrix multiplication. The paper approximates this expectation using empirical means within a mini-batch. As shown in Algorithm 2, this is computed all at once through simple matrix multiplications between the weight matrix derived from kernel similarities and the sample matrix. This operation is extremely fast on GPU architectures.
The complexity of these distance calculations is bounded by O(N x N_{pos} x D) and O(N x N_{neg} x D). Compared to the massive forward pass computations required by the heavily parameterized Generator (DiT) and multi-scale Feature Encoder, the cost of these matrix operations is practically zero. The authors explicitly confirm this in Appendix A.5, stating: "once the feature encoder is run, the computational cost of our drifting loss is negligible."
Q2. The proposed Drifting field V in the paper seems to play the same role as the score function in score-based models. However, Vis not a target of learning. Specifically, what are the similarities and differences between V and the Score function?
A2. the Drifting field V and the Score function in score-based models share a conceptual root: they both act as a "vector field that moves samples toward the true data distribution." However, due to the paradigm shift proposed in this paper, they have fundamental differences in terms of what the neural network learns, when the iterative movement occurs, and how they are formulated. Here are the specific similarities and differences.
Both concepts serve as vectors in a high-dimensional space indicating the direction and magnitude by which a sample (noise or intermediate data) should move to align with the target data distribution.
The most critical differences lie in "what learns what?" and "when does the iterative movement happen?"
1. Target of Learning vs. Supervisory Signal
Score Function: It is the gradient of the log probability density of perturbed data. In score-based models, the neural network itself is the target being trained to approximate this score function. Thus, the network's output is the vector field.
Drifting Field V: It is not a learned target. V is a non-parametric mathematical formulation calculated explicitly during a training batch via kernel-based interactions (attraction and repulsion) between generated samples and real samples. V serves as the supervisory signal (the target in the loss function) that guides the Generator network to update its weights and move the samples to the correct positions.
2. Timing of Iteration: Inference Time vs. Training Time
Score Function: After training, it is applied iteratively (tens to hundreds of times) during inference using numerical ODE solvers or Langevin dynamics to progressively denoise a sample.
Drifting Field V: It is applied iteratively during the training process (e.g., via SGD or Adam). At each training iteration, V is computed to slightly evolve the generator's pushforward distribution. As a result, once training is complete, the generator produces perfect images in a single pass (1-NFE) during inference, and V is not used at all at inference time.
Q3. The paper strongly emphasizes computing the Drifting Loss in a pre-trained "Feature Space" rather than the raw data space (pixel or latent space), even noting that the model fails on complex datasets like ImageNet without a feature encoder. Why does the feature space play such an indispensable role in this model?
A3. The core mechanism of Drifting Models relies on a kernel function k(x, y) to measure similarities between generated samples x and real data y+, which dictates the pull of the drifting field. The paper defines this kernel based on the L_2 distance (Equation 12).
However, in high-dimensional raw pixel spaces or simply compressed latent spaces, the numerical difference in pixel values does not accurately reflect semantic similarity. In raw space, the distance between two different cat images might be just as large as the distance between a cat and a car. Consequently, all samples appear far apart, making the kernel "flat." When this happens, the drifting field V vanishes, failing to provide meaningful gradient signals for training (as mentioned in Section 5.2).
In contrast, feature encoders pre-trained via Self-Supervised Learning (SSL, e.g., MAE, MoCo) are designed to map semantically similar samples close together in the feature space. By computing distances and kernels in this space, the model guarantees that a generated sample receives a strong, accurate signal to drift toward the real data that is semantically closest to it, providing rich training signals.
Q4. In the toy experiments (Section 5.1) of the paper, even when the generated distribution q is terribly initialized in a collapsed, single-mode state, it successfully discovers all multiple modes of the target distribution p. How do Drifting Models inherently overcome the "mode collapse" problem that commonly plagues general generative models (like GANs), and specifically, what roles do the "Repulsion" term in Equation (10) and "Anti-symmetry" play in this process?
A4. Drifting Models are robust against mode collapse because their objective function is designed to model "distribution-to-distribution interactions" rather than single-sample trajectories. In this mechanism, the Repulsion/Attraction term act as critical mechanisms.
According to Equation (10) in the paper, the drifting field is formed by the attraction of real data minus the repulsion of generated data.
Suppose the generative model suffers from mode collapse, causing all generated samples x to cluster tightly in a single, narrow region. In this scenario, the distance between the generated samples becomes extremely small, causing the value of the kernel function k(x, y^-) to skyrocket. Consequently, the generated samples exert an immensely strong repulsive force on one another. This repulsion mathematically forbids the samples from staying clustered; it violently pushes the trapped samples outward in all directions, forcing them to explore the broader space.
Now, as the samples are pushed outward by repulsion and escape their collapsed state, they simultaneously feel the attractive force of the real data. In the vicinity of the initially collapsed mode, the strong repulsive force cancels out the attractive force. However, in other modes of the real data that the generator has not yet reached (empty modes), there are no generated samples nearby to create repulsion. Therefore, only a pure, unopposed attractive force exists in those areas. As a result, the dispersed samples are naturally, rapidly, and strongly pulled into these undiscovered modes of the target distribution.