Journal:Geochemical biodegraded oil classification using a machine learning approach

Full article title	Geochemical biodegraded oil classification using a machine learning approach
Journal	Geosciences
Author(s)	Bispo-Silva, Sizenando; Ferreira de Oliveira, Cleverson J.; de Alemar Barberes, Gabriel
Author affiliation(s)	Centro de Pesquisas Leopoldo Américo Miguez de Mello, University of Coimbra
Primary contact	Email: sizenando at petrobras dot com dot br
Editors	Malvić, Tomislav; Martinez-Frias, Jesus
Year published	2023
Volume and issue	13(11)
Article #	321
DOI	10.3390/geosciences13110321
ISSN	2076-3263
Distribution license	Creative Commons Attribution 4.0 International
Website	https://www.mdpi.com/2076-3263/13/11/321
Download	https://www.mdpi.com/2076-3263/13/11/321/pdf?version=1698160370 (PDF)

This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed.

Abstract

Chromatographic oil analysis is an important step for the identification of biodegraded petroleum via peak visualization and interpretation of phenomena that explain the oil geochemistry. However, analyses of chromatogram components by geochemists are comparative, visual, and consequently slow. This article aims to improve the chromatogram analysis process performed during geochemical interpretation by proposing the use of convolutional neural networks (CNN), which are deep learning techniques widely used by big tech companies. Two hundred and twenty-one (221) chromatographic oil images from different worldwide basins (Brazil, USA, Portugal, Angola, and Venezuela) were used. The open-source software Orange Data Mining was used to process images by CNN. The CNN algorithm extracts, pixel by pixel, recurring features from the images through convolutional operations. Subsequently, the recurring features are grouped into common feature groups. The training result obtained a classification accuracy (CA) of 96.7% and an area under the receiver operating characteristic (ROC) curve (AUC) of 99.7%. In turn, the test result obtained a 97.6% CA and a 99.7% AUC. This work suggests that the processing of petroleum chromatographic images through CNN can become a new tool for the study of petroleum geochemistry since the chromatograms can be loaded, read, grouped, and classified more efficiently and quickly than the evaluations applied in classical methods.

Keywords: convolutional neural network, biodegradation, organic geochemistry, Orange Data Mining, chromatogram image

Introduction

The gas chromatography (GC) technique is widely used by the oil industry and can answer questions related to the origin of the oil and the physical and chemical conditions of production, refining, and storage. [1] Recently, the emergence of artificial intelligence (AI) techniques has opened up the data processing, grouping, and classification of complex imaged data, which by extension could also be applied to classify chromatogram components. [2]

Image data are part of the analytical routine practiced by petroleum geochemists, who use the proportion among chromatographic peaks to define the precursor geological environment and identify contamination by drilling fluid, light exhaust, mixing of oils, and even biodegradation. [3,4,5]

It is important to create a routine for labeling geochemical data in a way that facilitates its extraction and transformation into information to support companies’ decision-making. The most modern way to reach this level of management is through machine learning (ML) techniques controlled by experts in the field. In the case study of this paper, the users will quickly decide whether, in their analysis, they need to extract biodegraded oils from the data. Hence, the users will be able to download data efficiently with a low risk of noise, which will enable them to obtain more accurate information.

Oil biodegradation is a phenomenon caused by bacterial activity under 80 °C, often found in shallow reservoir conditions close to water/oil contact. [6,7] These bacteria tend to consume oil’s light compounds in the saturate fraction (preferably n-alkanes and then isoalkanes) and then consume aromatics. Further, there are resistant compounds that form complex chemical structures. They are located at the chromatographic baseline hump called the unresolved complex mixture (UCM). [1] As the biodegradation process is initiated, UCM tends to climb, whereas the concentration of n-alkanes decreases. These observations allowed Wenger [6] to build a biodegradation scale to rank the extent of biodegradation at five biodegradation levels: very slight, slight, moderate, heavy, and severe biodegradation (Figure 1). The biodegrading bacteria begin to consume the C₈–C₁₅ alkanes, accompanied by a very slight UCMs climb. Following, at a moderate level, bacteria consume the most part of n-alkanes (nC₁₅₊); however, UCM presents a tenuous hump.

Figure 1. Biodegradation Stages. Based on Wenger. [8]

The petroleum density is vital to the oil and gas industry because it implies reservoir recovery’s cost reduction together with the refined products’ quality, which can reduce production costs for companies. [5,8] The °API gravity decreases with the light compounds’ loss as well as petroleum quality. [3,6,9] This phenomenon is more sensitive at a slight to moderate biodegradation level than at a moderate to severe biodegradation level. [9]

Pristane and phytane are two iso-alkanes commonly found in petroleum and represented in petroleum chromatograms next to nC₁₇ and nC₁₈, respectively. The ratio between the chromatographic peaks of these compounds indicates the probable degree of biodegradation. At the level of moderate biodegradation, the pristane/phytane ratio is little changed. At a heavy level, the UCM hump is very prominent, and n-alkanes become rare. [3,6,10] When the biodegradation reaches the severe stages, biomarkers begin to be consumed, and the demethylated hopanes (25-norhopane) are formed as a result of the ring-opening process by bacteria. [10] If the reservoir underwent more than one oil’s charge and there is 25-norhopane together with n-alkanes, it suggests the oil’s pulse mixture. [3,6,8,11]

In geochemical studies of petroleum, it is common to analyze many samples or compare a few samples with previous analyses to group them, classify the characteristics of the oil, and propose a diagnosis of the studied area (e.g., well, reservoir, basin, etc.). So, in essence, the accurate evaluation of each chromatogram image can take a very long time for the geochemist due to the large number of analyses or the complexity of the samples. However, the use of AI in such geochemical analysis brings cost and time savings and reduces the possibility of interpretation errors. Despite this promise, topics related specifically to the organic geochemistry of petroleum involving the use of AI in image processing are still rare.

The use of statistics in petroleum geochemistry began around the 1960s, with simpler regression techniques and bivariate data. Subsequently, multivariate techniques with chemometrics and ML began to be used more widely because of the spread of computers and the increase in computational capacity. [12]

Chemometrics aims to explain chemical phenomena through statistical methods, which, in turn, can be processed in a computer quickly by AI algorithms (using ML and deep learning techniques). A milestone in the use of AI in petroleum geochemistry is the work of McCammon [13], who used the separation of clusters (dendrograms) in oil constituents in order to unravel which of the three horizons producers (in fields in California) would preferentially drain. Wang et al. [14] did an extensive review of the use of chemometric and ML methods in petroleum geochemistry, introducing the possibility of using concentration data in certain situations.

One of the main deep learning algorithms for image classification is the convolutional neural network (CNN), through which a mapping is made from images, finding recurring features and classifying them through neural networks. CNN is an algorithm used to process and classify files of the type of images that have been developed since the 1980s, but it gained popularity in 2012 [15,16] when it aroused the interest of big tech. CNN is a deep learning method that caught the attention of the scientific community at the International Skin Imaging Collaboration of 2017, when the technique was used to classify images of melanomas with precision similar to experienced dermatologists, bringing speed to the diagnosis of this disease. [17,18] CNN uses a large amount of categorized image data (e.g., topographies such as hill, valley, and mountain) that are read pixel by pixel and transformed into a vector of scores, one for each category. The goal of the algorithm is that each category has the highest score, reducing the error between the output vector and the standard vector. To reduce error, the algorithm uses “weights” (millions of adjustable parameters) that control the input and output of the network and compute the vector that indicates how much a slight change in the weight could increase or decrease the mistake. This is possible because of the stochastic gradient descent (SGD), a technique responsible for presenting the input vector, calculating the output ones and their respective errors repeatedly, and readjusting the weight with each new measurement. The sum of the vector weights is computed, and when it is above a certain range, it is classified as a feature in a category. [15,19,20]

Surveys involving the use of the same or similar algorithms began to be published with topics related to other areas of knowledge. In Geology, de Lima et al. [21] used images of fossils, rock samples, cores, and petrographic samples to classify and group them, and satisfactory results were obtained. Other authors were also able to classify rock images in order to improve petrographic analysis time through ternary diagrams. [22,23] CNN has been used to classify explosive volcanic plumes [24], fossil identification [25], and unstructured geological text data clustering. [26] Koeshidayatullah et al. [27] used transfer learning [28] to classify 4,000 carbonate petrographic images in six classes, as well as nine object detection classes. Pires de Lima et al. [28] also used transfer learning to make lithofacies classifications with approximately 7,000 images split into 17 classes. These authors also compared different pre-trained models to accurately classify petrographic thin-section images. [29] CNN was successfully used to identify rock fractures from outcrop pictures and drills. [30,31] Kim et al. [32] applied CNN to identify saturation changes in core images caused by gas hydrate dissociation. With regard to source rock, the CNN coupled with an unsupervised algorithm was used in well logging data to predict total organic carbon (TOC), S₂, and S₁ values [33,34] and was used in seismic images to identify petroleum system elements and consequently hydrocarbon leads. [35] In addition, some papers used semantic segmentation to identify coal macerals and determine their rank. [36,37] According to some authors, CNN can be used to predict rock porosity through data logging, seismic images [38,39], and permeability. [40] Zeng and Wang [41] were able to use CNN to classify SAR images from oil spills with greater accuracy than conventional ML methods. Moreover, some authors have used CNN to classify remote-sensing image scenes. [42,43,44]

In the forensic area, Bogdal et al. [45] used chromatogram image data to classify flammable waste and determine the presence of traces of gasoline. Furthermore, in the field of organic chemistry, some works used the CNN to qualify affected peaks by elution on gas chromatography–mass spectrometry (GC-MS) chromatograms in order to discriminate the noise from the true peak. [46]

This article aims to report a process automation of image analysis with the purpose of discriminating biodegraded oils from non-biodegraded oils. The success of this test, in addition to speeding up the analysis process, brings a new look at the geochemical characterization of oils.

Materials and methods

Convolutional neural network (CNN)

References

Notes

This presentation is faithful to the original, with only a few minor changes to presentation, spelling, and grammar. In some cases important information was missing from the references, and that information was added. The footnote at the end of the original version was turned into a formal citation for this version.