Computer Science · 221 papers

Data gaps in Computer Science

261 open data research questions in Computer Sciencegaps in available data, datasets, benchmarks, or measurements — extracted from 221 papers in our local library. Below are representative open questions, each linked to the paper that raised it.

Representative open questions

Showing 30 of 261 — one per source paper, highest-quality first.

  • Target discovery and drug design in the era of artificial intelligence (2026) · doi

    Graph neural networks (GNNs) are often trained on small, curated datasets and may not generalize well to larger, more diverse chemical spaces; more comprehensive training datasets are needed to improve GNN performance.

  • AN ITERATIVE GLMM–XGBOOST ALGORITHM WITH GROUP-AWARE CONDITIONAL PERMUTATION IMPORTANCE FOR EXPLAINING MULTILEVEL ITEM RESPONSE DATA (2026) · doi

    Simulation Study 1 evaluates parameter recovery and prediction accuracy across ICC levels (0.00 to 0.25) and sample sizes, but does not test scenarios with extreme ICC values (>0.50) or highly imbalanced cluster sizes, which are common in educational and longitudinal item response studies.

  • Ethical challenges of artificial intelligence in education: A systematic literature review on bias, privacy, and academic integrity (2026) · doi

    Conversational AI chatbots used for student advising and mental health support can propagate prejudices and encourage overreliance on machine recommendations, but the paper does not identify what specific bias detection benchmarks or evaluation datasets should be created to test chatbots across different demographic groups and cultural contexts before educational deployment.

  • A standardized workflow for kinetic metabolic model curation and dissemination (2026) · doi

    The workflow acknowledges that comprehensive and condition-matched kinetic datasets are unavailable for most non-model organisms, but does not provide concrete strategies or benchmarks for model reduction and coarse-graining that preserve predictive accuracy while reducing parameter requirements in sparse-data scenarios.

  • A novel deep learning approach for accurate and efficient design of LNOI power splitters (2026) · doi

    The comparative validation section (3.4) mentions evaluation on literature-based data with results for P1 output only, but the excerpt text cuts off before presenting quantitative accuracy metrics, error distributions, or specific performance comparisons with other inverse design methodologies for LNOI photonic devices.

  • Can deep learning-based segmentation and classification improve the detection of renal cortical abnormalities? (2026) · doi

    The study acknowledges that DMSA scintigraphy underestimates scarring in 30% of kidneys, particularly polar scars which are especially susceptible to misinterpretation. The deep learning-based segmentation and classification models were trained on DMSA images with this inherent underestimation bias, but the paper does not investigate how this label noise affects model performance or propose methods to correct for DMSA underestimation in polar scar detection.

  • COGNITIVE INFRASTRUCTURE AND THE RECURSIVE TRANSFORMATION OF KNOWLEDGE COMMUNICATION: GENERATIVE AI IN SCIENTIFIC PUBLISHING (2026) · doi

    The paper argues that cognitive infrastructure has become a permanent feature of knowledge production and that business models determine technological trajectories through control of training data access and domain-specific model development. However, it does not specify which datasets, training regimes, or model architectures produce competitive advantage in different scholarly domains (STEM vs. humanities vs. social sciences) or how data asymmetries correlate with market consolidation outcomes.

  • Artificial intelligence as a potential empowerment tool for single mothers: opportunities, risks, and structural implications (2026) · doi

    The paper emphasizes that AI design must accurately represent the reality of caregiving, gendered labor, and varied family configurations, but provides no concrete guidelines for how to operationalize or validate such representation in AI training datasets or algorithmic decision-making for single-mother populations.

  • Leveraging machine learning to enhance aerosol classification using Single-Particle Mass Spectrometry (2026) · doi

    Class imbalance remains a fundamental constraint for Soot (0.8% support) and biological particles (Bacteria, Snomax, Agar, Hazelnut), where limited training data prevents model development despite their high atmospheric significance. Targeted approaches must be developed to overcome the scarcity of labeled single-particle mass spectra for these underrepresented but scientifically critical aerosol types.

  • Quantum Information Framework for Neural Network Generalization: A Comprehensive Experimental Analysis (2026) · doi

    The modular arithmetic dataset generation supports only three operations (x+y, x²+y, x³+xy) with a fixed modulus of 97; systematic investigation of how operation complexity, modulus size, and arithmetic structure affect quantum information metrics during neural network training is absent.

  • Management of depression utilizing Traditional Chinese Medicine (2026) · doi

    AI-driven network pharmacology approaches are proposed to predict botanical drug-metabolite-target interactions and inform rational combination strategies for TCM and Western antidepressants, but no specific implementation framework, training datasets, or validation protocols for these predictions in depression treatment are currently detailed.

  • Advances of artificial intelligence applications to low-carbon metallurgy of iron and steel (2026) · doi

    Machine learning applications for defect detection in low carbon steel WAAM products employ cost-sensitive convolutional neural networks with remanence/magnetooptical imaging, but the validation dataset encompasses only limited defect types; expansion to diverse surface and subsurface defect morphologies is required.

  • The transformative impact of AI-enabled AlphaFold 3: evolution, current status, and future prospects in structural biology (2026) · doi

    Benchmark sets for statistically correct evaluation of AlphaFold applications require regular updating to prevent overfitting and memorization on standard test sets. Current validation protocols lack agreed-upon, regularly refreshed benchmark datasets for evaluating AlphaFold2 and AlphaFold3 performance across protein classes.

  • Research on a strongly generalizable fault diagnosis method based on adversarial transfer learning (2026) · doi

    The HDAL model demonstrated superior noise robustness compared to DANN and DSAN (90.156% accuracy with noise vs. 84.384% and 86.329% respectively), but the paper does not investigate the specific noise types, frequency ranges, or signal-to-noise ratios tested. Future work should systematically characterize the types of sensor noise present in nuclear reactor operational transients and evaluate HDAL performance across varying noise conditions relevant to real plant instrumentation.

  • Machine-learning-based reconstruction of Ming-dynasty defensive corridors in Yuxian (2026) · doi

    The study identifies that highly suitable defense corridor areas overlap with high-value zones of kernel density and visibility density, but does not quantify the magnitude of spatial overlap, provide statistical correlation coefficients, or test whether kernel density bandwidth selection and visibility raster resolution significantly affect the overlap assessment.

  • Unveiling digital knowledge pathways for pregnancy-related green food purchase orientations: a comparative fsQCA study of social media discourse in China and Thailand (2026) · doi

    The machine learning topic recognition and fsQCA analysis focused on static social media text from pregnant women; longitudinal tracking of how knowledge configurations and green food consumption orientations evolve during pregnancy trimesters or across multiple pregnancies, and how algorithm-mediated platform exposure shapes knowledge acquisition over time, was not conducted.

  • From unstructured text to structured reasoning: a hybrid knowledge graph for Indonesian sentencing analysis (2026) · doi

    The study evaluated entity extraction on only corruption and narcotics offenses from Indonesian court decisions; applicability to other offense categories (theft, assault, environmental crimes) and cross-jurisdictional legal systems with different statutory structures and epistemological frameworks remains untested.

  • Phishing in the age of distributed intelligence: taxonomies, detection strategies, and the emerging role of federated learning (2026) · doi

    Existing phishing detection datasets have remained largely static and unchanged for several years, with most being centralized rather than distributed. These datasets do not reflect the non-IID distributional characteristics and data heterogeneity present in real federated learning environments, limiting the ability to properly evaluate federated learning approaches for phishing detection.

  • Neural Network Tools in the Arsenal of a University Teacher (2026) · doi

    The research employed a relatively small sample size, which limited the precision of conclusions about AI-based literature search tools for university teaching. A larger-scale empirical study is needed to identify important trends in how AI tools (Galactica, Semantic Scholar, Lens) compare against traditional bibliographic databases (Web of Science, Scopus, Dimensions) specifically for academic literature retrieval tasks in higher education contexts.

  • Securing Fog-assisted IoT: An Adaptable and Efficient Threat Identification Approach (2026) · doi

    The evaluation uses four existing datasets without specifying which specific IoT attack types (e.g., DDoS variants, zero-day exploits, protocol-specific attacks) are represented in each dataset. The generalizability of the DEL approach to emerging and unknown threat categories in fog-IoT systems remains unvalidated.

  • A Robust Hybrid Deep Learning Model for Multiclass Depression Classification from Speech Audio (2026) · doi

    EEG data are available in the dataset repository but were not utilized; multimodal fusion integrating EEG with audio signals for improved robustness in multiclass depression severity classification remains unexplored and is explicitly positioned as a future research direction.

  • On the interface between linguistics, computer science and psychiatry: analyzing textual key-factors affecting BERT-based classification of schizophrenia in social media texts (2026) · doi

    The r/AskDocs subreddit analysis revealed lexical overfitting with only 18 samples; this requires expansion to larger samples across multiple health-adjacent social media contexts to quantify the degree to which explicit disorder-related vocabulary inflates classification performance and to establish minimum dataset sizes needed to control lexical bias in mental-health NLP models.

  • Cardiovascular Risk Prediction Using Machine Learning: Advances and Clinical Translation (2026) · doi

    Existing datasets used for cardiovascular risk prediction model development lack explicit patient consent for AI applications, but the paper does not specify what robust consent framework structures should be implemented or how secondary data use for machine learning should be transparently communicated to patients regarding algorithmic processing and decision-support applications.

  • Predicting Employee Attrition: A Machine Learning Approach in Human Resource Analytics (2026) · doi

    While the paper notes that Overtime emerges as a major predictor in Gradient Boosting (rank 3) for attrition, reflecting workload effects, it does not empirically measure the threshold at which overtime hours transition from being a retention factor to a significant attrition driver, or examine sector-specific variation in this relationship.

  • GaussianSeal: Rooting Adaptive Watermarks for 3D Gaussian Generation Model (2026) · doi

    While the paper compares GaussianSeal against GaussianMarker and post-generation methods like 3DGS+HiDDeN on specific datasets (Chair, Lego, Hotdog, Mic), comprehensive evaluation across diverse 3D object categories with varying geometric complexity, texture density, and scale properties is not provided. Generalization of watermark capacity and robustness to complex real-world 3D scenes remains unexplored.

  • Cluster Pattern Analysis of Students Stress using Machine Learning Algorithms with Feature Engineering (2026) · doi

    The study uses a limited dataset from a single institution that cannot be generalized across higher education institutions. A cross-institutional dataset comparing stress clusters and stressor patterns across multiple educational settings is needed to validate whether the identified student stress clusters (high-risk, moderate-risk, and low-risk groups) are consistent across different institutional contexts.

  • Interpretable machine learning-based modelling of minimum miscibility pressure in hydrocarbon gas injection processes (2026) · doi

    The leverage analysis using William's plot detected only one suspected data point out of the entire dataset, but no investigation was conducted into whether this outlier represents a genuine physical anomaly in MMP behavior or a measurement error that warrants exclusion or separate sub-model development.

  • Scour depth prediction using machine learning and explainable AI: assessment of bridge vulnerability (2026) · doi

    The study compared Gradient Boosting, XGBoost, CatBoost, Random Forest, and ANN-based models on a single scour dataset; cross-validation across multiple independent bridge scour datasets with different geological, hydrodynamic, and pier geometry characteristics is needed to confirm model generalizability for practical bridge vulnerability assessment.

  • A Motion-Based Compression and Tracking System for Video Camera Trap-Based Insect Behaviour Studies (2026) · doi

    The proposed motion-based compression system was evaluated on four datasets (Ratnayake et al. 2020, Voort der van Driessche et al., Navid et al., and Nest Monitoring), but systematic evaluation across camera trap deployments with varying environmental conditions (wind patterns, illumination changes, vegetation density) and different insect taxa beyond Honey bees, Syrphidae, Lepidoptera, and Vespidae has not been conducted. Dataset diversity limitations prevent generalization of compression efficiency and behavioral detection accuracy claims.

  • An Ontology Driven Machine Learning Framework for Early Prediction in Children with Cerebral Palsy (2026) · doi

    The current framework operates with limited input modalities for early prediction in children with cerebral palsy; integration of multimodal neuroimaging data, electromyography signals, or video-based movement analysis alongside clinical observations could address whether ensemble methods and ontological inference rules can exploit richer feature diversity for improved Level 4 precision (currently 0.65 for SVM).

Working on one of these gaps? Publish with us.

Science AI Journal reviews manuscripts in under 15 minutes with 8 specialised AI reviewers calibrated on 23,000+ real peer reviews. Open access, CC BY 4.0.

Other gap types in Computer Science

Command palette

Jump anywhere, run any action.