Foundational Standards and Platforms

Recent developments in AI/ML, especially deep learning, make models more complex and challenging to interpret. Having access and confidence in data resources is imperative for reliable energy research. To achieve high-quality data resources, it is important to have foundational standards to help researchers have a consistent and transparent workflow and ensure the models and results can be generalized for field applications. SAMI works on developing foundational platforms that embed the standards and practices needed to reduce the thresholds for researchers to apply AI/ML in their projects. The institute strives to make data standards and frameworks compatible with the FAIR data principles to promote the interpretability and reliability of AI/ML models.

SAMI follows FAIR Data Principles (Findable, Accessible, Interoperable, and Reusable), published in Scientific Data, a set of guiding principles proposed by a consortium of scientists and organizations to support the reusability of digital assets. It has since been adopted worldwide by research institutions, journal publications, and data repositories.

  • Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18

Determining quality metrics, rating the quality of data, and stratifying data by quality is a way to reduce the relative uncertainty and grow confidence in building a model using ML. Several custom quality metrics have been established and integrated into the data processing workflow to understand sources of uncertainty in the data. Key focuses for reliable data include:

  • Completeness: Amount of relevant data and metadata included in dataset
  • Accuracy: Degree of confidence in values
  • Usability: How easy is it to extract data?
  • Standardization: Is the testing body accredited? Were standards followed in testing?

Finally, domain knowledge is integrated into the models at critical validation and verification steps. For example, comparing unsupervised machine learning models with expert-applied labels helps validate results. SAMI supports numerous cross-disciplinary research efforts to help ensure expert knowledge drives modeling and interpretation.

Geospatial and Subsurface Geologic Data Quality

Understanding geospatial and subsurface geologic data quality is essential to the reliability of scientific predictions produced from data applied to subsurface modeling and AI/ML applications. Reliable reuse of datasets requires analyzing geospatial and subsurface data quality prior to use, and methodologies that guide the producer and the consumer of that data are valuable. NETL is developing scalable (from desktop to cluster computing) custom deep learning algorithms, leveraging NLP for data and knowledge extraction and for information exploration and classification, reliant on high quality documents, data, and accompanying metadata. NETL is currently developing a geologic data quality analysis method to analyze data quality of large, multi-variable datasets and text-based data resources prior to publishing and after collection for reuse. The method is based on the metrics of completeness, accuracy, usability and authority of source. The analysis method aids data users and producers in understanding reliability of data applied to carbon storage, basin modeling, geothermal, and other subsurface modeling applications. Some previous geospatial and database quality analysis methods that this method is based on are:

  • Devillers, R., Jeansoulin, R., 2010. Spatial Data Quality: Concepts. Fundamentals of Spatial Data Quality. John Wiley & Sons, Ltd, pp. 31–42. https://doi.org/10.1002/9780470612156.ch2
  • Harding, J., 2010. Vector Data Quality: A Data Provider’s Perspective, in: Fundamentals of Spatial Data Quality. John Wiley & Sons, Ltd, pp. 141–159. https://doi.org/10.1002/9780470612156.ch8
  • Pipino, L.L., Lee, Y.W., Wang, R.Y., 2002. Data quality assessment. Communications of the ACM 45, 211–218. https://doi.org/10.1145/505248.506010
  • Veregin, H., 2005. 12. Data quality parameters. New Developments in Geographical Information Systems: Principles, Techniques, Management and Applications. Wiley, p. 404.

GAIA Computational Facilities

NETL’s Geospatial Analysis, Interpretation and Assessment (GAIA) Computational Facility provide consistent access to research-caliber workstations at all NETL sites to meet geologic and environmental science research and development needs for a range of projects and users. The science-based analyses conducted using GAIA facilities improves our understanding of geologic and environmental systems, exposes knowledge and technology gaps, and drives further research. The integrated and collaborative setting of GAIA facilities enables knowledge-sharing across projects and disciplines, thus improving NETL’s efforts in solving energy issues related to these systems.

Variable Grid Method (VGM) Tool

In 2018, NETL granted a software license for its Variable Grid Method (VGM) Tool to aid in the analysis and interpretation of spatial data trends to a Texas-based startup called VariGrid Explorations LLC. NETL’s VGM is a novel approach to data visualization that employs geographic information system capabilities to simultaneously quantify and visualize spatial data trends and underlying data uncertainty. The method provides a user-friendly, flexible, and reliable tool to effectively communicate spatial data, as well as the data’s inherent uncertainties, in a single, unified product.

Applications that use big data, data analytics, and advanced computing run an inherent risk that results could be misleading or contain unseen error and uncertainty. The VGM is a simple, widely accepted tool for communicating that uncertainty or error with spatial data driven products. VGM helps communicate the relationship between uncertainty and spatial data to effectively guide research, support advanced computation analyses, and inform management and policy decisions. The method is applicable to a wide range of end users for reducing risks, improving decision-making, and reducing costs.

Global Oil & Gas Database (GOGI) database

NETL’s Global Oil & Gas Database (GOGI) database, launched in 2018, provides critical information to decision-makers across the globe to ensure public health, safety and security as stakeholders tackle oil and gas infrastructure developments, improvements and challenges. An NETL-led team of 13 researchers created the first-of-its-kind database, which identifies and provides vital information about more than 6 million individual features — such as wells, pipelines and ports — from over 1 million data sets in 193 countries and Antarctica. GOGI has been downloaded over 1000 times, and used by Harvard, EDF, UNEP and others.

  • Rose, K.; Bauer, J.; Baker, V.; Bean, A.; DiGiulio, J.; Jones, K.; Justman, D.; Miller, R. M.; Romeo, L.; Sabbatino, M.; Tong, A. Development of an Open Global Oil and Gas Infrastructure Inventory and Geodatabase; NETL-TRS-6-2018; NETL Technical Report Series; U.S. Department of Energy, National Energy Technology Laboratory: Albany, OR, 2018; p 594; DOI: 10.18141/1427573.

Advanced Alloy Materials Data Quality Metrics and Rating Tool

Domain-specific data quality metrics are key to data analytics efforts in materials science as well. To fill this need, metrics were developed through the eXtremeMAT project to assess the relative quality of alloy composition, processing and experimental testing data. Understanding the quality of experimental alloy data allows only the highest quality information to be integrated into analytical models, while low to medium quality data can be reserved for model assessment and validation. Stratification of data based on quality is critical as data is integrated from numerous resources with varying data types and formats. Specific metrics were developed to assign a quality rating for the completeness, usability, accuracy, and standardization of the data. To enable researchers to easily rate a dataset, a data quality rating tool was developed and integrated into EDX.

  • Wenzlick, M., Bauer, J.R., Rose, K., Hawk, J. and Devanathan, R., 2020. Data Assessment Method to Support the Development of Creep-Resistant Alloys. Integrating Materials and Manufacturing Innovation, pp.1-14. https://doi.org/10.1007/s40192-020-00167-3

Understanding Data Processing and Analysis Metadata Attributes

Interpretability of data analytics was further explored through the eXtremeMAT project. Metadata attributes needed to understand data processing and the model building workflow were investigated by comparing previous analyses with updated work on expanded datasets.

  • Wenzlick, M., M.G. Mamun, R. Devanathan, K. Rose, J.A. Hawk, “Data science techniques, assumptions, and challenges in alloy clustering and property prediction.” Journal of Materials Engineering and Performance 30, 823–838 (2021). https://doi.org/10.1007/s11665-020-05340-5

Multiphase Flow Science Optimization

Nodeworks is a flexible graphical programming library for graphical programming of process workflows. Nodeworks allows GUI developers to interface with other standalone GUI’s and to create customized algorithms–called nodes–to form specialized workflows. For more information, please visit https://mfix.netl.doe.gov/nodeworks/.