https://www.nature.com/articles/s41598-025-86804-3 Skip to main content Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript. Advertisement Advertisement Scientific Reports * View all journals * Search * Log in * Explore content * About the journal * Publish with us * Sign up for alerts * RSS feed 1. nature 2. scientific reports 3. articles 4. article Abstract visual reasoning based on algebraic methods Download PDF Download PDF * Article * Open access * Published: 28 January 2025 Abstract visual reasoning based on algebraic methods * Mingyang Zheng^1^ na1, * Weibing Wan^1^ na1 & * Zhijun Fang^2 Scientific Reports volume 15, Article number: 3482 (2025) Cite this article * 1447 Accesses * Metrics details Subjects * Computer science * Information technology Abstract Extracting high-order abstract patterns from complex high-dimensional data forms the foundation of human cognitive abilities. Abstract visual reasoning involves identifying abstract patterns embedded within composite images, considered a core competency of machine intelligence. Traditional neuro-symbolic methods often infer unknown objects through data fitting, without fully exploring the abstract patterns within composite images and the sequential sensitivity of visual sequences. This paper constructs a relation model with object-centric inductive biases, learning end-to-end multi-granular rule embeddings at different levels. Through a gating fusion module, the model incrementally integrates explicit representations of objects and abstract relationships. The model incorporates a relational bottleneck method from information theory, separating the input perceptual information from the embeddings of abstract representations, thereby restricting and differentiating feature processing to encourage relational comparisons and induce the extraction of abstract patterns. Furthermore, this paper bridges algebraic operations and machine reasoning through the relational bottleneck method, extracting common patterns of multi-visual objects by identifying invariant sequences within the relational bottleneck matrix. Experimental results on the I-RAVEN dataset demonstrate a total accuracy of 96.8%, surpassing state-of-the-art baseline methods and exceeding human performance at 84.4%. Similar content being viewed by others [41467_2023] Zero-shot visual reasoning through probabilistic analogical mapping Article Open access 24 August 2023 [41598_2024] Neural networks for abstraction and reasoning Article Open access 13 November 2024 [41467_2023] On the visual analytic intelligence of neural networks Article Open access 25 September 2023 Introduction The extraction of low-dimensional abstract features from complex high-dimensional data constitutes the essence of human intelligence. Humans perceive the world by discerning individual objects, relationships between objects, and higher-order abstract patterns. Mimicking this human capability stands at the core of artificial intelligence research. Abstract visual reasoning entails uncovering abstract patterns embedded within composite images and extending them to novel inputs, showcasing human prowess in abstract generalization. Raven's Progressive Matrices (RPM), acknowledged by researchers as a tool for evaluating the reasoning capabilities of visual models, hold significant recognition in the field^1,2. Figure 1 figure 1 The examples in the I-RAVEN dataset, with the correct answers highlighted in red boxes in the answer set. In this scenario, the model is required to identify relationships within specific domains (such as shape, color, etc.) in the source sequence and apply them to different domains to accurately complete the target sequence. Full size image The Raven's Progressive Matrices (RPM) is a type of multiple-choice intelligence test, depicted in Fig. 1, that tasks participants with filling in missing elements within a 3x3 image grid. Throughout the evaluation process, reasoners are required to abstract the visual input representations to uncover higher-order abstract relationships within composite images. RPM is utilized for assessing abstract language, spatial, and mathematical reasoning abilities, with this test also revealing significant variances within populations with higher levels of education^3. The ability to solve RPM puzzles is regarded by cognitive scientists as a paradigm of fluid intelligence, characterized by the acute identification of new relationships, abstraction, and flexible thinking^4. The term "fluid" refers to the capacity to discover new associations and demonstrate agile abstract thinking, applying abstract new patterns to previously unencountered problems. Abstract reasoning abilities are widely acknowledged as symbols of human intelligence, showcasing cognitive flexibility and adaptability^5,6. The method proposed in this paper leverages RPM tests to enhance the model's visual reasoning capabilities and elevate the model's level of cognitive intelligence. Constructing neural networks and utilizing reinforcement learning methods to solve RPM problems is a common practice among researchers^ 7,8,9,10. The majority of approaches involve using neural networks to extract features from original RPM images, extracting abstract relationships from low-level perceptual features, and selecting answers based on measuring feature similarity. During the reasoning test process, researchers have identified shortcut biases in the test, prompting the introduction of improvements like I-RAVEN^11 and RAVEN-fair^12 to address these biases. The new test datasets impose higher demands on the model's abstract perceptual capabilities. Our model establishes a bridge between algebraic operations and machine reasoning through the relation bottleneck method, explicitly transforming multi-visual reasoning problems into 0-1 relation bottleneck matrices, and identifies system invariance through the comparison and observation of sequential features. The model integrates object-centric representations and the relation bottleneck method to construct a robust inductive bias reasoning framework, demonstrating strong visual reasoning capabilities on the test dataset. The main contributions of this paper are as follows: * By utilizing the relation bottleneck method, a bridge between algebraic operations and machine reasoning was constructed, explicitly converting multi-visual reasoning problems into 0-1 relation bottleneck matrices, observing system invariance in the comparison of sequence features. * The integration of object-centric representations and the relation bottleneck method established a reasoning framework with strong inductive bias, constraining and differentiating feature information to encourage relational comparisons and thereby induce the extraction of abstract patterns. * Bidirectional reasoning involving top-down and bottom-up processes, aggregating different representations to build a feedback mechanism, simulating a reasoning framework that mimics human thinking processes. Related work Abstract visual reasoning Traditional neurosymbolic approaches have demonstrated good performance in tasks involving visual feature recognition and causal relationship extraction. Li et al^13 integrated perception, analysis, and logical symbols into a unified reasoning framework to extract abstract patterns of composite targets through closed-loop logical reasoning. Hu et al.^11 utilized a Hierarchical Rule-Aware Network (SRAN) to generate rule embeddings for input sequences, learning multi-granularity rule embeddings at different levels. The SRAN model effectively integrates key inductive biases, such as sequence sensitivity, permutation invariance, and incremental rule induction, by learning multi-granularity rule embeddings at different levels. Taking two rows/columns as input, it progressively learns and integrates hierarchical rule embeddings across unit-level, entity-level, and system-level embeddings. These multi-granularity embeddings are gradually fused through a gating fusion module, naturally preserving sequence sensitivity and effectively mapping inputs into the rule embedding space. These neural networks do not fully induce abstract representations of underlying rules but tend to overfit visual features, thereby failing to generalize to new inputs with the same patterns^14. Traditional neural-symbolic methods are efficient in feature extraction and capturing relationships between features, yet complex reasoning demands extracting higher-order abstract relationships within the problem. Object-centric relational representations, on the other hand, enable the acquisition of more advanced abstract logic^15. Abstract visual patterns are typically characterized by relationships between objects rather than the features of the objects themselves. Learning object-centric representations of complex scenes is a promising step towards efficient abstract reasoning from low-level perceptual features^16. It involves a model that explicitly represents objects and their abstract relationships, combining slot-based object representations with transformer-based architectures^17,18. Object-centric learning seeks to achieve a general and compositional understanding of scenes by representing scenes as a collection of object-centric features, inducing the formation of abstract relationships through multi-level compositional reasoning^19. Object-centric learning Learning object-centric representations enables machines to perceive the visual world in a manner similar to humans^20, extracting object representations from images and training end-to-end through downstream tasks^16,21,22. This approach allows for data-driven inference of relationships and unsupervised prediction of structured object properties. Object-centric representation methods capture visual objects at different levels and granularities^23,24, but they are relatively fragile in handling changes unrelated to visual tasks. To mitigate this limitation, we integrate this approach with the relation bottleneck method^25 and establish a bridge between algebraic operations and machine reasoning. Relation bottleneck method Our extraction of abstract patterns from visual images fundamentally represents a stronger inductive bias^26,27. Abstract visual patterns are typically characterized by relationships between objects, and the discovery of these abstract relationships naturally begins with the objects themselves. CoRelNet^28 represents a typical model demonstrating the logic of relation bottlenecks. It involves an encoder processing sensory observations to generate object embeddings and computing a relationship matrix for pairs of objects, capturing inter-object similarities through inner products. Subsequently, this relationship matrix is fed into a decoder network, which can vary in architecture, such as using multi-layer perceptrons or transformers. This process establishes a relation bottleneck where perceptual information is channeled into the matrix, transforming it into a form that retains only relational information before being transmitted to the decoder.The relation bottleneck approach utilizes inner products to represent pairwise relationships, whereas traditional relational networks rely on task-specific learned generic neural network components (such as multi-layer perceptrons). Although traditional methods are theoretically more flexible, their structures do not explicitly constrain the network to learn only relational information. This architecture is prone to learning shortcuts thet al.ign too closely with training data , thus impairing its ability to effectively learn and generalize relationships to out-of-distribution inputs. Inner product operations are inherently relational^29,30,31, and the relation bottleneck method ensures downstream processing is solely based on relationships. Similarity is the cornerstone of human reasoning, and any abstract mechanism is primarily based on similarity. Our model adopts a data-driven approach to induce abstract relationship models by restricting information processing to focus mainly on relationships between visual objects^32, thereby promoting the emergence of abstract mechanisms resembling similarity relationships. This method allows for rapid learning of relational patterns and systematic generalization. Figure 2 figure 2 The abstract reasoning framework consists of three core modules: (1) extraction of object-centric representations, (2) extraction of pairwise relationship embeddings using bottleneck method, (3) representation of high-order abstract patterns through relation bottleneck matrices and querying of common features in sequences.. Full size image Methodology Object-centric slot attention mechanism We utilized the Slot Attention module to extract initial image features^16. The slot attention mechanism learns to highlight individual objects, accomplishing image segmentation in an unsupervised manner. It interacts with the output of the preceding convolutional neural network, generating task-specific abstract representations called slots. During the initialization phase, slots compete with each other, occupying attentional regions on a per-pixel basis. Ultimately, through attention maps, attention is aggregated between each slot and pixel to achieve object-centric visual feature representation, enabling the adoption of data-driven inductive biases. Trained on unsupervised object discovery and supervised attribute prediction tasks, Slot Attention can extract object-centric representations, enabling a comprehensive understanding of multi-input visual images as depicted in Fig. 3. Figure 3 figure 3 Object-centric slot attention mechanism. Full size image Firstly, the Convolutional Neural Network (CNN) encodes the input image to generate visual feature maps of dimensions \(HW\times {{D}_ {enc}}\),where \(H\) and \(W\) represent the height and width of the input image. The Slot Attention mechanism, in a single iteration over a set of input features, maps \(N\) input feature vectors through iterative attention to \(K\) output slots^16. Here, the slots are \(K \) output slots with a dimensionality of \({{D}_{slots}}\),\(slots\in {{R}^{K\times D}}\).The slots are initialized randomly and are bound to specific input attributes through \(T\) iterations. The slots possess learnable mean and variance, denoted by \(\mu \in {{R}^{{{D}_ {slots}}}}\) and \({{\sigma }^{2}}\in {{R}^{{{D}_{slots}}}}\) . utilize learnable linear transformations q, k, and v to map inputs and slots to a shared dimension \(D\). $$\begin{aligned} \begin{aligned}&atte{n_{i,j}}: = \tfrac{{\exp ({M_ {i,j}})}}{{\sum \nolimits _l {\exp ({M_i},l)} }},where \\&M: = \frac {1}{{\sqrt{D} }}k(inputs).q{(slots)^T} \in {R^{HW \times K}} \\ \end {aligned} \end{aligned}$$ (1) To prevent the attention mechanism from overlooking input features, it is specified that the sum of attention coefficients for input feature vectors is 1. To enhance the stability of the attention mechanism, we enforce this constraint through weighted averaging. $$\begin{aligned} \begin{aligned}&update: = {W^T}.v(inputs) \in {R^{K \times D_{slots}}}, \\&where\begin{array}{*{20}{c}} & \end{array}{W_ {i,j}}: = \frac{{att{n_{i,j}}}}{{\sum \nolimits _{l = 1}^N {att{n_ {l,j}}} }}. \\ \end{aligned} \end{aligned}$$ (2) The slot attention mechanism controls the slot updating through a hidden gated unit with \({D_{slots}}\)^33, ensuring the improvement of output performance by employing a multi-layer perceptron with residual connections and RELU activation to transform the gated unit. In practice we further add a small offset \(\delta\) to the attention coefficients to avoid numerical instability. An MLP with residual connections^34 acts independently on each slot, enabling the capture of features such as shape, color, and size of input images through feature aggregation of the slot attention mechanism^35. To alleviate the insensitivity of the slot attention mechanism to positional information, we introduce an enhanced positional encoding method in the model. Specifically, by establishing the visual center of an image and establishing interactive relationships between each sub-image and the center, we obtain relative positional information for each sub-image. Verification of positional information accuracy is conducted through transformations applied to the visual center. This enhancement aims to enhance the model's perception of positional information and strengthen the correlation between position and features.The algorithm logic can be seen in Table 1. Table 1 Slot attention module. Full size table After the slot attention mechanism, we introduced a gating mechanism^ 36. In addition to segregating different representations into the relational bottleneck method, we also introduced a bidirectional reasoning mechanism. This mechanism constructs a feedback loop to compare similar representations in the answer set and question set, reducing unnecessary inference interference. For instance,as in Fig. 2, feedback from the answer set reveals a consistent shape pattern, which is then fed back into the forward reasoning process, thereby enhancing the model's inference efficiency and accuracy. Bottleneck method The architecture of traditional neural networks constrains their focus on individual input attributes, emphasizing the perception of overall input features. This limitation makes it challenging in practice to identify complex natural conceptual objects^37. To address this issue, we have adopted a data-driven approach to induce abstract models, introducing the principle of a relational bottleneck. This method encourages the emergence of abstract mechanisms in the network by restricting information processing to primarily focus on relationships between inputs. By introducing a controller to decouple the abstract representation of input from embedded perceptual information, this separation ensures that the representation in the control path is abstract relational. It facilitates the learning of rule embeddings at different hierarchical levels and granularities, enabling rapid learning of relational patterns and systematic generalization within the network. The relational bottleneck represents an inductive bias that prioritizes the representation of relationships (such as "same" and "different") over mixed representations of all object features.We define the relational bottleneck as a mechanism that restricts the flow of information from the perceptual system to downstream inference systems, such that the representations passed to downstream processes are composed entirely of relationships. Given input representing individual objects, the relational bottleneck constrains the representations passed to downstream inference processes to capture only the relationships between these objects (for example, whether objects share the same color). Even though the objects depicted in these images may be different, there exist abstract relationships between individual attributes such as color and shape. The relational bottleneck method extracts abstract relationships and key attribute features from known images to infer unknown images. By abstracting representations of individual attributes and aggregating abstract representations of multiple attributes, the method achieves decoupling of aggregated representations of multiple attributes, enabling the inference of unknown shapes. Consider an information processing system that receives input signals \(X\) and aims to predict target signal \(Y\). Input \(X\) is processed to generate a compressed representation \(Z\) = f(\(X\)) (referred to as a "bottleneck"), which is then used for predicting \ (Y\). The core idea of information bottleneck theory is "minimal sufficiency," where if \(Z\) contains all the information about \(Y\) encoded in \(X\), then it is sufficient for predicting \(Y\). This can be represented as abstract relational reasoning\(X \rightarrow Z \rightarrow Y\), where X and \(Y\) are sets of the most relevant feature attributes. It can also be characterized as I (\(Z\); \(Y\)) = I(\(X\); \(Y\)), where \(I(*;*)\) denotes mutual information, and \ (X\) represents the set of feature attributes of the objects. $$\begin{aligned} \begin{aligned} X = ({x_{size}},{x_{colour,}}..., {x_n}) \end{aligned} \end{aligned}$$ (3) For a sufficiently compressed representation \({Z^*}\), we have $$\begin{aligned} \begin{aligned} I(X;Z) \le I(X;{Z^*}) \end{aligned} \end{aligned}$$ (4) There exists a trade-off between achieving maximum compression while retaining as much relevant information as possible. This trade-off is captured by the information bottleneck objective, which is achieved by minimizing $$\begin{aligned} \begin{aligned} {\text {minimize }}\Psi (Z) = I(X; Z) - \beta I(Z;Y) \end{aligned} \end{aligned}$$ (5) where \(\beta\) controls this trade-off.This objective reflects the tension between preserving the information related to \(Y\) captured by the second term and discarding the information captured by the first term during the compression process. The relational bottleneck mechanism learns from the input and compresses it into relational representations that are separate from object features. This enables the model to search within a smaller compressed representation space, a constrained space that ensures sufficient attribute features to complete tasks while excluding much information irrelevant to the input signal \(X\). This promotes effective learning of abstract relationships. Sequence-to-sequence and algebraic machine reasoning In the relational bottleneck approach, the control path cannot actively inspect or manipulate the specific content in the perceptual path. Instead, it influences the system's behavior by comparing perceptual representations, specifically identifying system invariances through observation comparisons. To better simulate human reasoning, we propose a reasoning framework highly suitable for abstract reasoning called the Algebraic Machine Reasoning Framework, which consists of two stages: (1) Relational Bottleneck Algebraic Representation and (2) Algebraic Machine Reasoning. In the first stage, the problem images are initially divided into nine sub-images. By integrating slot attention mechanisms and the relational bottleneck approach, the sub-images are decomposed into different slots using slot attention mechanisms. The relational bottleneck approach is then employed to represent key feature information from different slots. For simple Raven's Progressive Matrices (RPM) visual question-answering tasks, direct comparisons between sub-images within the same slot are not necessary. Instead, we can represent the overall information of the nine sub-images in the question as \({J_{ij}}\), where \({J_{ij}}\) denotes the representation of the sub-image at the i-th row and j-th column.Similarly, the representations of the sub-images in the answer set are denoted as \({A_{ij}}\). $$\begin{aligned}&\begin{array}{l} \mathbf{{J}} = [{J_{11}},{J_ {12}},...{J_{ij}}],1 \le i,j \le slots*9 \end{array} \end{aligned}$$ (6) $$\begin{aligned}&\begin{array}{l} \mathbf{{A}} = [{A_{11}},{A_ {12}},...{A_{ij}}],1 \le i,j \le slots*9 \end{array}\end{aligned}$$ (7) Inspired by human cognition, we employ the relational bottleneck method to constrain and differentiate feature information, thereby extracting abstract representations from images and obtaining sequential features of matrices through comparisons of these abstract representations. Specifically, taking \({J_{11}}\) as the visual center, we compare \({J_{11}}\) with \({J_{ij}}\) (including \({J_ {11}}\)) based on the single attribute representation obtained through the relational bottleneck method. If the two representations are identical, we assign a value of 1; otherwise, we assign 0 and record the result at the \({J_{ij}}\) position. This process yields the attribute comparison results between this visual center and other sub-images, denoted as the relational bottleneck matrix \({G_{11}}\). Subsequently, selecting another sub-image\({J_{ij}}\) as the visual center, we perform the same attribute comparisons to generate relational bottleneck matrices \({G_{ij}}\) represented solely in 0 and 1. The matrix representation G, based on the bias of inductive properties, is then derived from the combination of the nine relational bottleneck sub-matrices at specific positions. $$\begin{aligned} \begin{array}{l} G = [{G_{11}},{G_{12}},...{G_ {ij}}],1 \le i,j \le slots * 9 \end{array}\end{aligned}$$ (8) Similarly, the slot attention mechanism, serving as the initial attribute extraction module, can also decompose sub-images into different slots. For images with complex combinations of sub-image attribute features, comparing whether the slot attribute representations of different sub-images at the same position are identical enables the establishment of relational bottleneck matrices with more representation relationships. However, a more complex relational bottleneck matrix does not necessarily facilitate solving visual reasoning problems. The appropriate number of slots is crucial for solving various RPM problems. In the second stage, we simplify the task of "solving RPM tasks" to "solving the problem of sequential invariance in relational bottleneck matrices." In the first stage, we demonstrated the relational bottleneck matrices containing only 0 and 1 obtained through slot attention mechanisms and the relational bottleneck method. Subsequently, we observed distinct invariant sequential features in the relational bottleneck matrices. Extending this periodic pattern to unknown graphic relationship sequences allows us to derive a single-attribute relational bottleneck matrix for the unknown graphics. Aggregating the features of relational bottleneck matrices with multiple attributes enables the derivation of the key attribute features of unknown graphics. Thus, visual reasoning problems are transformed into algebraic methods, where the task is to examine the invariance of these extracted patterns across multiple sequences. See Fig. 4. In high-dimensional relational bottleneck matrix G, finding sequential invariance is relatively easy. Taking row sequences as an example, where \({G_{ij}}\) represents the digit at the \({i}th\) row and \({j}th\) column of the relational bottleneck matrix, arranging this relational matrix in an invariant way by rows and columns and storing it in an array, assuming different sequences as cyclic features, and verifying the assumption with existing sequences enables the determination of sequence invariance. Partial sequence features of unknown graphics are known, and through the coupling of invariant sequence features, the complete sequence features of unknown graphics can be derived. Decoupling through each relational bottleneck matrix yields the key attribute features of unknown graphics, facilitating the comprehensive inference of unknown graphics. Figure 4 figure 4 The color relational bottleneck matrix of the problem set in Fig. 1 of the reasoning case. This figure reflects the relational bottleneck matrix extracted from the problem set of Fig. 1 regarding colors, along with cyclic sequence features highlighted by different colors. Full size image Experiment Experimental setup We evaluated the abstract reasoning framework on RAVEN^2 and I-RAVEN^ 11. These two datasets consist of 7 different configurations of RPM, collectively examining the model's multi-visual input reasoning capabilities. Each configuration comprises 1000 samples. We divided the dataset into 10 parts, using 6 groups for training, 2 for validation, and 2 for testing. The image size was adjusted to 80x80, with pixels normalized to the range [0, 1]. Data augmentation techniques, including rotations in multiples of 90 degrees and overall brightness adjustments, were applied during training.As shown in Table 2 . Table 2 Hyper parameters for property prediction. Full size table The forward process primarily involves the preliminary extraction of feature information using CNN and MLP, with ReLU serving as the activation function to normalize the network's output into a probability distribution over predicted output categories. The loss function comprises the cross-entropy loss between the relation bottleneck and centroid representations, acting as the overall loss function. For the slot attention mechanism, we employed K = 9 slots for RAVEN and K = 16 slots for I-RAVEN. The number of iterations for the slot attention mechanism was set to T = 3 for both I-RAVEN and RAVEN, with the slot dimension set to \({D_{slot}}\) = 32. Hyperparameters for the transformer module included setting the number of heads H to 8, the number of layers L to 6, the dimension \({D_{head}}\) per head to 32, the dimension of the hidden layer in the MLP \({D_{MLP}}\) to 512, and a dropout rate of 0.1.All models were implemented in PyTorch and optimized using Adam^38 on Nvidia GPUs. We conducted testing on 2000 instances, with running these instances on an 8-core Intel i7 CPU processor taking 13 h. Analysis of experimental results We compared our reasoning framework with the weak baselines LSTM^2, WReN^14, ResNet^14, and the strong baselines ResNet+DRT^14, Wild^11, DCNet^39citep, SRAN^11, PrAE^10, and STSN^18, all of which are mainstream frameworks for visual reasoning. In both datasets, RPM was generated based on 7 configurations. We trained the initial perceptual module driven by data to train CNN on 3500 images (600 images per configuration) from I-RAVEN^18 and the object-centric slot attention perceptual module. We used neutral data to train the slot attention mechanism and transformer decoder and fine-tuned them on the main task to improve their accuracy in extracting initial feature attributes. Our model achieved an average accuracy of 96.8% in predicting unknown shapes across 7 types of tests, surpassing all baseline models as shown in Table 3. Table 3 Performance on I-RAVEN/RAVEN. Full size table Table 4 Inference accuracy for individual attributes On I-RAVEN/ RAVEN. Full size table It was observed in the experiment that among the 8 sub-images segmented from each image, selecting the sub-image with the maximum number of objects and matching the corresponding slots yielded optimal results (including empty slots). Our model demonstrated more advantages in problems with a higher number of visual inputs, achieving higher accuracy compared to the baselines in \(2\times 2\) Grid and \(3\times 3\) Grid problems. Due to the inherent characteristics of the relational bottleneck matrix, with more visual inputs, richer relation connections are obtained, making common sequences easier to acquire. Similarly, we conducted separate measurements on different test types, as shown in Table 4. Ablation studies were also part of our experimental setup. Basic image enhancement and appropriate geometric rotations were performed, and since these operations were conducted on the overall data, they did not introduce new interference. Our data augmentation method proved to significantly enhance the model accuracy, as removing data augmentation led to a decrease of around 4% in model accuracy. We also conducted ablations on the size of the transformer module and found that a smaller number of transformer layers, L = 4 layers, performed poorly. The optimal number of layers was determined to be 6, with this parameter affecting the average accuracy by approximately 2%. A distinctive feature of our model is the object-centric representation. We weighted the input feature vector values of the image by averaging and only utilized the initial CNN and essential feature extraction to primarily extract the visual object's feature attributes, reducing the extraction of object relations. By removing the slot attention mechanism from the model, the test accuracy decreased by over 40%, as indicated by "our-CNN" in Table 3. This effect was particularly pronounced in the "\(3\times 3\) Grid" test, highlighting the crucial capability of our object-centric model in extracting relations between multiple input objects. We also eliminated the position interaction module from the slot attention mechanism, resulting in an overall accuracy decrease of approximately 5%. The model's sensitivity to position decreased, especially with an 8% accuracy drop in the U-D test. These results demonstrate that the object-centric representation is a central component of our approach. Figure 5 figure 5 RPM Instance. In this instance, there are no visual changes in shape, color, or size. It can be inferred that the second figure in each row is the sum of the first and third figures. Full size image Our relational bottleneck method excels in extracting abstract relations from multiple visual inputs. As illustrated in Fig. 5, conventional neural network models struggle to perceive these abstract relations. In this scenario, the second figure can be derived by superimposing (including positions) the first and third figures in each row. Similarly, inferring the third figure can be achieved by subtracting the first figure from the second.However, our model demonstrates outstanding similarity extraction mechanisms. Through bidirectional reasoning and feedback mechanisms, it excludes changes in color, shape, and size, focusing solely on abstract representations of quantity and position. By evenly distributing each individual sub-image into 9 slots, and comparing the similarity of attributes between slots, this abstract mechanism is differentiated. By inferring through sequence similarity within the 9 slots, the complete answer can be deduced. Conclusions In this paper, we introduce an abstract reasoning framework that emphasizes object-centric strong inductive biases. This framework combines the object-centric slot attention mechanism and the relational bottleneck method to extract abstract rules for solving complex multi-visual input reasoning problems. Similarity forms the cornerstone of our reasoning system, and by identifying system invariances through comparisons, we establish relational bottleneck matrices to uncover common sequences for inferring the characteristics of unknown shapes. The incorporation of algebraic methods is a distinctive feature of our model, offering a novel solution for visual reasoning. However, real-world complex images lack clear segmentation boundaries and necessitate customized decomposition of their attribute features. Further exploration is required to truly address visual reasoning on datasets from the human world. In our future work, we aim to delve deeper into simulating human reasoning and extend the methods based on similarity and relational bottlenecks to other visual reasoning datasets. Our enduring goal is to achieve structured abstract reasoning capabilities akin to those of humans. Data availability The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request. References 1. Carpenter, P. A., Just, M. A. & Shell, P. What one intelligence test measures: A theoretical account of the processing in the raven progressive matrices test. Psychol. Rev. 97, 404 (1990). Article PubMed MATH Google Scholar 2. Zhang, C., Gao, F. & Raven, J., et al. A dataset for relational and analogical visual reasoning. In CVPR, 5317-5327 (2019). 3. Snow, R. E. et al. The topography of ability and learning correlations. Adv. Psychol. Hum. Intell. 2, 103 (1984). MATH Google Scholar 4. Perret, P. Children's inductive reasoning: Developmental and educational perspectives. J. Cogn. Educ. Psychol. 14, 389-408 (2015). Article MATH Google Scholar 5. Carroll, J. B. Human cognitive abilities: A survey of factor-analytic studies Vol. 1 (Cambridge University Press, 1993). Book MATH Google Scholar 6. Jaeggi, S. M. et al. Improving fluid intelligence with training on working memory. Proc. Natl. Acad. Sci. 105, 6829-6833 (2008). Article ADS PubMed PubMed Central MATH Google Scholar 7. Lyu, M., Liu, R. & Wang, J. Solving raven's progressive matrices using rnn reasoning network. In ICCIA, 32-37 (IEEE, 2022). 8. Zhang, C., Jia, B. & et al, G. Learning perceptual inference by contrasting. NIPS32 (2019). 9. Zombori, Z., Urban, J. & Brown, C. E. Prolog technology reinforcement learning prover: (System description). In International Joint Conference on Automated Reasoning, 489-507 (Springer, 2020). 10. Zhang, C., Jia, B. & et al, Z. Abstract spatial-temporal reasoning via probabilistic abduction and execution. In CVPR, 9736-9746 (2021). 11. Hu, S., Ma, Y. & et al, L. Stratified rule-aware network for abstract visual reasoning. In AAAI, vol. 35, 1567-1574 (2021). 12. Benny, Y., Pekar, N. & Wolf, L. Scale-localized abstract reasoning. In CVPR, 12557-12565 (2021). 13. Li, Q., Huang, S. & et al, H. Closed loop neural-symbolic learning via integrating neural perception, grammar parsing, and symbolic reasoning. 5884-5894 (PMLR, 2020). 14. Barrett, D. & Hill, F. A. A. Measuring abstract reasoning in neural networks. In International conference on machine learning, 511-520 (PMLR, 2018). 15. Engelcke, M., Parker Jones, O. & Posner, I. Genesis-v2: Inferring unordered object representations without iterative refinement. In Advances in Neural Information Processing Systems, vol. 34, 8085-8094 (Curran Associates, Inc., 2021). 16. Locatello, F. et al. Object-centric learning with slot attention. Adv. Neural Inf. Process. Syst. 33, 11525-11538 (2020). MATH Google Scholar 17. Ding, D. et al. Attention over learned object embeddings enables complex visual reasoning. NIPS 34, 9112-9124 (2021). Google Scholar 18. Mondal, S. S. A. A. Learning to reason over visual objects. ICLR (2023). 19. Veerapaneni, R. & et al, C.-R. Entity abstraction in visual model-based reinforcement learning. In Conference on Robot Learning, 1439-1456 (PMLR, 2020). 20. Kim, J. & et al, C. Shepherding slots to objects: Towards stable and robust object-centric learning. In CVPR, 19198-19207 (2023). 21. Xu, M. & et al, Z. End-to-end semi-supervised object detection with soft teacher. In CVPR, 3060-3069 (2021). 22. Elsayed, G. A. A. Savi++: Towards end-to-end object-centric learning from real-world videos. NIPS 35, 28940-28954 (2022). Google Scholar 23. Kabra, Raa. Simone: View-invariant, temporally-abstracted object representations via unsupervised video decomposition. NIPS 34, 20146-20159 (2021). Google Scholar 24. Chen, C., Deng, F. & Ahn, S. Roots: Object-centric representation and rendering of 3d scenes. J. Mach. Learn. Res. 22, 1-36 (2021). MathSciNet PubMed PubMed Central MATH Google Scholar 25. Tishby, N. & Zaslavsky, N. Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw), 1-5 (IEEE, 2015). 26. Altabaa, A, et al. Abstractors and relational cross-attention: An inductive bias for explicit relational reasoning in transformers. ICLR (2023). 27. McClelland, J. L. Capturing advanced human cognitive abilities with deep neural networks. Trends Cognit. Sci. 26, 1047-1050 (2022). Article MATH Google Scholar 28. Kerg, G. et al. On neural architecture inductive biases for relational tasks. ArXivabs/2206.05056 (2022). 29. Webb, T. W., Sinha, I. & Cohen, J. D. Emergent symbols through binding in external memory. arXiv:2012.14601 (2020). 30. Kim, J., Ricci, M. & Serre, T. Not-so-clevr: Learning same-different relations strains feedforward neural networks. Interface focus 8, 20180011 (2018). Article PubMed PubMed Central Google Scholar 31. Ichien, N. et al. Visual analogy: Deep learning versus compositional models. In Proceedings of the 43rd Annual Meeting of the Cognitive Science Society (2021). 32. Huai, T. & Yang, S. Debiased visual question answering via the perspective of question types. Pattern Recognit. Lett. 178, 181-187 (2024). Article ADS MATH Google Scholar 33. Cho, K. & van Merri et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP , 1724-1734 (Association for Computational Linguistics, Doha, Qatar, 2014). 34. He, K., Zhang, X. & Ren, S. Deep residual learning for image recognition. In CVPR, 770-778 (2016). 35. Zhang, Y. et al. Prn: Progressive reasoning network and its image completion applications. Sci. Rep. 14, 23519 (2024). Article PubMed PubMed Central Google Scholar 36. Li, W. & Sun, J. Visual question answering with attention transfer and a cross-modal gating mechanism. Pattern Recognit. Lett. 133, 334-340 (2020). Article ADS MATH Google Scholar 37. Chen, Z., De Beuckelaer, A., Wang, X. & Liu, J. Distinct neural substrates of visuospatial and verbal-analytic reasoning as assessed by ravenaEUR(tm)s advanced progressive matrices. Sci. Rep. 7, 16230 (2017). Article ADS PubMed PubMed Central Google Scholar 38. Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization (ICLR, 2014). MATH Google Scholar 39. Zhuo, T. & Kankanhalli, M. Effective abstract reasoning with dual-contrast network. ICLR (2022). Download references Acknowledgements Our project has received support from the Ministry of Science and Technology's Major Project on Technological Innovation 2030 - "Next Generation Artificial Intelligence," with the project number 2020AAA0109300. Author information Author notes 1. Mingyang Zheng and Weibing Wan contributed equally to this work. Authors and Affiliations 1. School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai, 201620, China Mingyang Zheng & Weibing Wan 2. School of Computer Science and Technology, Donghua University, Shanghai, 201620, China Zhijun Fang Authors 1. Mingyang Zheng View author publications You can also search for this author inPubMed Google Scholar 2. Weibing Wan View author publications You can also search for this author inPubMed Google Scholar 3. Zhijun Fang View author publications You can also search for this author inPubMed Google Scholar Contributions Mingyang zheng and weibing wan conceived, designed, and conducted the experiments, collected and analyzed data, drafted and revised the manuscript; Mingyang zheng and zhijun fang performed the experiments and analyzed data. All analyzed data, verified experiments and revised the manuscript. Corresponding author Correspondence to Weibing Wan. Ethics declarations Competing interests The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Additional information Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Rights and permissions Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/ licenses/by-nc-nd/4.0/. Reprints and permissions About this article Check for updates. Verify currency and authenticity via CrossMark Cite this article Zheng, M., Wan, W. & Fang, Z. Abstract visual reasoning based on algebraic methods. Sci Rep 15, 3482 (2025). https://doi.org/10.1038/ s41598-025-86804-3 Download citation * Received: 20 October 2024 * Accepted: 14 January 2025 * Published: 28 January 2025 * DOI: https://doi.org/10.1038/s41598-025-86804-3 Share this article Anyone you share the following link with will be able to read this content: Get shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative Keywords * Abstract patterns * Inductive biases * End-to-end * Object-centric Download PDF Advertisement Advertisement Explore content * Research articles * News & Comment * Collections * Subjects * Follow us on Facebook * Follow us on Twitter * Sign up for alerts * RSS feed About the journal * About Scientific Reports * Contact * Journal policies * Guide to referees * Calls for Papers * Editor's Choice * Journal highlights * Open Access Fees and Funding Publish with us * For authors * Language editing services * Open access funding * Submit manuscript Search Search articles by subject, keyword or author [ ] Show results from [All journals] Search Advanced search Quick links * Explore articles by subject * Find a job * Guide to authors * Editorial policies Scientific Reports (Sci Rep) ISSN 2045-2322 (online) nature.com sitemap About Nature Portfolio * About us * Press releases * Press office * Contact us Discover content * Journals A-Z * Articles by subject * protocols.io * Nature Index Publishing policies * Nature portfolio policies * Open access Author & Researcher services * Reprints & permissions * Research data * Language editing * Scientific editing * Nature Masterclasses * Research Solutions Libraries & institutions * Librarian service & tools * Librarian portal * Open research * Recommend to library Advertising & partnerships * Advertising * Partnerships & Services * Media kits * Branded content Professional development * Nature Careers * Nature Conferences Regional websites * Nature Africa * Nature China * Nature India * Nature Italy * Nature Japan * Nature Middle East * Privacy Policy * Use of cookies * Your privacy choices/Manage cookies * Legal notice * Accessibility statement * Terms & Conditions * Your US state privacy rights Springer Nature (c) 2025 Springer Nature Limited Close Nature Briefing AI and Robotics Sign up for the Nature Briefing: AI and Robotics newsletter -- what matters in AI and robotics research, free to your inbox weekly. Email address [ ] Sign up [ ] I agree my information will be processed in accordance with the Nature and Springer Nature Limited Privacy Policy. Close Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics *