Journal:Geochemical biodegraded oil classification using a machine learning approach

Full article title	Geochemical biodegraded oil classification using a machine learning approach
Journal	Geosciences
Author(s)	Bispo-Silva, Sizenando; Ferreira de Oliveira, Cleverson J.; de Alemar Barberes, Gabriel
Author affiliation(s)	Centro de Pesquisas Leopoldo Américo Miguez de Mello, University of Coimbra
Primary contact	Email: sizenando at petrobras dot com dot br
Editors	Malvić, Tomislav; Martinez-Frias, Jesus
Year published	2023
Volume and issue	13(11)
Article #	321
DOI	10.3390/geosciences13110321
ISSN	2076-3263
Distribution license	Creative Commons Attribution 4.0 International
Website	https://www.mdpi.com/2076-3263/13/11/321
Download	https://www.mdpi.com/2076-3263/13/11/321/pdf?version=1698160370 (PDF)

This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed.

Abstract

Chromatographic oil analysis is an important step for the identification of biodegraded petroleum via peak visualization and interpretation of phenomena that explain the oil geochemistry. However, analyses of chromatogram components by geochemists are comparative, visual, and consequently slow. This article aims to improve the chromatogram analysis process performed during geochemical interpretation by proposing the use of convolutional neural networks (CNN), which are deep learning techniques widely used by big tech companies. Two hundred and twenty-one (221) chromatographic oil images from different worldwide basins (Brazil, USA, Portugal, Angola, and Venezuela) were used. The open-source software Orange Data Mining was used to process images by CNN. The CNN algorithm extracts, pixel by pixel, recurring features from the images through convolutional operations. Subsequently, the recurring features are grouped into common feature groups. The training result obtained a classification accuracy (CA) of 96.7% and an area under the receiver operating characteristic (ROC) curve (AUC) of 99.7%. In turn, the test result obtained a 97.6% CA and a 99.7% AUC. This work suggests that the processing of petroleum chromatographic images through CNN can become a new tool for the study of petroleum geochemistry since the chromatograms can be loaded, read, grouped, and classified more efficiently and quickly than the evaluations applied in classical methods.

Keywords: convolutional neural network, biodegradation, organic geochemistry, Orange Data Mining, chromatogram image

Introduction

The gas chromatography (GC) technique is widely used by the oil industry and can answer questions related to the origin of the oil and the physical and chemical conditions of production, refining, and storage. [1] Recently, the emergence of artificial intelligence (AI) techniques has opened up the data processing, grouping, and classification of complex imaged data, which by extension could also be applied to classify chromatogram components. [2]

Image data are part of the analytical routine practiced by petroleum geochemists, who use the proportion among chromatographic peaks to define the precursor geological environment and identify contamination by drilling fluid, light exhaust, mixing of oils, and even biodegradation. [3,4,5]

It is important to create a routine for labeling geochemical data in a way that facilitates its extraction and transformation into information to support companies’ decision-making. The most modern way to reach this level of management is through machine learning (ML) techniques controlled by experts in the field. In the case study of this paper, the users will quickly decide whether, in their analysis, they need to extract biodegraded oils from the data. Hence, the users will be able to download data efficiently with a low risk of noise, which will enable them to obtain more accurate information.

Oil biodegradation is a phenomenon caused by bacterial activity under 80 °C, often found in shallow reservoir conditions close to water/oil contact. [6,7] These bacteria tend to consume oil’s light compounds in the saturate fraction (preferably n-alkanes and then isoalkanes) and then consume aromatics. Further, there are resistant compounds that form complex chemical structures. They are located at the chromatographic baseline hump called the unresolved complex mixture (UCM). [1] As the biodegradation process is initiated, UCM tends to climb, whereas the concentration of n-alkanes decreases. These observations allowed Wenger [6] to build a biodegradation scale to rank the extent of biodegradation at five biodegradation levels: very slight, slight, moderate, heavy, and severe biodegradation (Figure 1). The biodegrading bacteria begin to consume the C₈–C₁₅ alkanes, accompanied by a very slight UCMs climb. Following, at a moderate level, bacteria consume the most part of n-alkanes (nC₁₅₊); however, UCM presents a tenuous hump.

Figure 1. Biodegradation Stages. Based on Wenger. [8]

The petroleum density is vital to the oil and gas industry because it implies reservoir recovery’s cost reduction together with the refined products’ quality, which can reduce production costs for companies. [5,8] The °API gravity decreases with the light compounds’ loss as well as petroleum quality. [3,6,9] This phenomenon is more sensitive at a slight to moderate biodegradation level than at a moderate to severe biodegradation level. [9]

Pristane and phytane are two iso-alkanes commonly found in petroleum and represented in petroleum chromatograms next to nC₁₇ and nC₁₈, respectively. The ratio between the chromatographic peaks of these compounds indicates the probable degree of biodegradation. At the level of moderate biodegradation, the pristane/phytane ratio is little changed. At a heavy level, the UCM hump is very prominent, and n-alkanes become rare. [3,6,10] When the biodegradation reaches the severe stages, biomarkers begin to be consumed, and the demethylated hopanes (25-norhopane) are formed as a result of the ring-opening process by bacteria. [10] If the reservoir underwent more than one oil’s charge and there is 25-norhopane together with n-alkanes, it suggests the oil’s pulse mixture. [3,6,8,11]

In geochemical studies of petroleum, it is common to analyze many samples or compare a few samples with previous analyses to group them, classify the characteristics of the oil, and propose a diagnosis of the studied area (e.g., well, reservoir, basin, etc.). So, in essence, the accurate evaluation of each chromatogram image can take a very long time for the geochemist due to the large number of analyses or the complexity of the samples. However, the use of AI in such geochemical analysis brings cost and time savings and reduces the possibility of interpretation errors. Despite this promise, topics related specifically to the organic geochemistry of petroleum involving the use of AI in image processing are still rare.

The use of statistics in petroleum geochemistry began around the 1960s, with simpler regression techniques and bivariate data. Subsequently, multivariate techniques with chemometrics and ML began to be used more widely because of the spread of computers and the increase in computational capacity. [12]

Chemometrics aims to explain chemical phenomena through statistical methods, which, in turn, can be processed in a computer quickly by AI algorithms (using ML and deep learning techniques). A milestone in the use of AI in petroleum geochemistry is the work of McCammon [13], who used the separation of clusters (dendrograms) in oil constituents in order to unravel which of the three horizons producers (in fields in California) would preferentially drain. Wang et al. [14] did an extensive review of the use of chemometric and ML methods in petroleum geochemistry, introducing the possibility of using concentration data in certain situations.

One of the main deep learning algorithms for image classification is the convolutional neural network (CNN), through which a mapping is made from images, finding recurring features and classifying them through neural networks. CNN is an algorithm used to process and classify files of the type of images that have been developed since the 1980s, but it gained popularity in 2012 [15,16] when it aroused the interest of big tech. CNN is a deep learning method that caught the attention of the scientific community at the International Skin Imaging Collaboration of 2017, when the technique was used to classify images of melanomas with precision similar to experienced dermatologists, bringing speed to the diagnosis of this disease. [17,18] CNN uses a large amount of categorized image data (e.g., topographies such as hill, valley, and mountain) that are read pixel by pixel and transformed into a vector of scores, one for each category. The goal of the algorithm is that each category has the highest score, reducing the error between the output vector and the standard vector. To reduce error, the algorithm uses “weights” (millions of adjustable parameters) that control the input and output of the network and compute the vector that indicates how much a slight change in the weight could increase or decrease the mistake. This is possible because of the stochastic gradient descent (SGD), a technique responsible for presenting the input vector, calculating the output ones and their respective errors repeatedly, and readjusting the weight with each new measurement. The sum of the vector weights is computed, and when it is above a certain range, it is classified as a feature in a category. [15,19,20]

Surveys involving the use of the same or similar algorithms began to be published with topics related to other areas of knowledge. In Geology, de Lima et al. [21] used images of fossils, rock samples, cores, and petrographic samples to classify and group them, and satisfactory results were obtained. Other authors were also able to classify rock images in order to improve petrographic analysis time through ternary diagrams. [22,23] CNN has been used to classify explosive volcanic plumes [24], fossil identification [25], and unstructured geological text data clustering. [26] Koeshidayatullah et al. [27] used transfer learning [28] to classify 4,000 carbonate petrographic images in six classes, as well as nine object detection classes. Pires de Lima et al. [28] also used transfer learning to make lithofacies classifications with approximately 7,000 images split into 17 classes. These authors also compared different pre-trained models to accurately classify petrographic thin-section images. [29] CNN was successfully used to identify rock fractures from outcrop pictures and drills. [30,31] Kim et al. [32] applied CNN to identify saturation changes in core images caused by gas hydrate dissociation. With regard to source rock, the CNN coupled with an unsupervised algorithm was used in well logging data to predict total organic carbon (TOC), S₂, and S₁ values [33,34] and was used in seismic images to identify petroleum system elements and consequently hydrocarbon leads. [35] In addition, some papers used semantic segmentation to identify coal macerals and determine their rank. [36,37] According to some authors, CNN can be used to predict rock porosity through data logging, seismic images [38,39], and permeability. [40] Zeng and Wang [41] were able to use CNN to classify SAR images from oil spills with greater accuracy than conventional ML methods. Moreover, some authors have used CNN to classify remote-sensing image scenes. [42,43,44]

In the forensic area, Bogdal et al. [45] used chromatogram image data to classify flammable waste and determine the presence of traces of gasoline. Furthermore, in the field of organic chemistry, some works used the CNN to qualify affected peaks by elution on gas chromatography–mass spectrometry (GC-MS) chromatograms in order to discriminate the noise from the true peak. [46]

This article aims to report a process automation of image analysis with the purpose of discriminating biodegraded oils from non-biodegraded oils. The success of this test, in addition to speeding up the analysis process, brings a new look at the geochemical characterization of oils.

Materials and methods

Convolutional neural network (CNN)

The first step in using CNN was to group the image bank according to categories (continuing the example given above, hill, valley, or mountain) and load it into the algorithm. Subsequently, the data goes through a set of convolutional layers that work as an extractor of recurring features from the images, rearranging them in a feature map (Figure 2). Each neuron in the feature map of a given layer is connected with all neurons of the previous layer via weights (filter banks). Lecun et al. [15] state that all units in the feature maps share the same filter bank, mathematically corresponding to a convolution. To obtain more robust and less general features that can recognize patterns at any position in the image, a nonlinear (Kernel) calculation method is used. This step is called a "pooling layer" and is responsible for reducing the variance in feature maps with distortions or translations (Figure 2). According to Lecun et al. [15], “although the role of the convolutional layer is to detect local conjunctions of features from the previous layer, the role of the pooling layer is to merge semantically similar features into one.”

Figure 2. Materialization of a convolutional network and its analytical flow. Adapted from Lecun et al. [15] and Rawat et al. [16]

Soon after, each layer is stacked on top of the previous one to extract more features (fully connected layers) being extensively trained through the backpropagation mechanism and, as a result, comes out with a predicted value (category or class).

CNN using Orange

The chromatogram images were loaded into the Orange software, where the InceptionV3 CNN algorithm was used for dedication (dimension reduction or embedded) and image processing by deep learning. [47,48] InceptionV3 is a CNN model that was trained on more than one million images. However, Orange can import the inceptionV3 knowledge for training new image types (transfer learning). InceptionV3′s transfer learning is important for data with a few samples since CNN works better with larger datasets. [2,28,49,50] The DL processing via CNN determines the weights and feature maps of the images by finding patterns and creating filters from the training images (81% of the images). Next, ML algorithms (standard neural networks, logistic regression, decision tree, naive bayes, and random forest) were employed to classify the embedded images and compare them with each other. The algorithm with the best accuracy was utilized to generate a prediction model for the test samples (19% of the images). In the test, the model was effectively tested with untrained samples and revealed the actual efficiency of the technique for image classification. The complete flowchart of the deep learning analysis through CNN of GC-imaged data can be seen in Figure 3.

Figure 3. Complete flowchart of image analysis in the Orange software. (a) input image; (b) convolutional calculations; (c) separation of test samples and training samples; (d) sample training with five algorithms; (e) the best model’s testing; and (f) output class.

A total of 221 whole oil images (chromatograms) in JPEG format from GC analysis were used and tested. These data show oils from foreign basins (East Venezuela, Lusitanian, and Lower Congo, among others); however, the vast majority belong to Brazilian basins (Campos, Santos, Recôncavo, and Potiguar, among others). The samples were previously classified as both biodegraded and non-biodegraded (Figure 4 and Table 1). However, some samples were purposely misclassified as biodegraded (they are not currently biodegraded) in order to evaluate the efficiency of the classification model with mistakes still in the training stage.

Figure 4. Chromatogram images used in the analysis and their pre-training classification. Figures (a) and (b) are chromatograms of biodegraded oil samples. Figure (a) presents the loss of light compounds (the peaks have a smaller carbon number than nC₁₆). Figure (b) shows the total loss of light compounds in addition to the rise of UCM. Figures (c) and (d) are chromatograms of non-biodegraded oil samples.

Biodegraded	Non-biodegraded
Table 1. Number of images used and original classification
92	129

The data were processed by CNN, which measured the images (180 images) and created specific filters for each category. Next, the image classifier was trained using the results calculated by the CNN to create a robust image prediction model of the chromatograms from biodegraded oils. There is a moderate difference in the number of images for each class. Nevertheless, in the test stage, the samples were stratified to avoid any bias in the model. For that, it was necessary to find the algorithm that would present the best result (accuracy).

Results

The algorithms Naive Bayes, Neural Networks, Random Forest, Decision Tree, and Logistic Regression were chosen to test the classification of images (Table 2). Neural Networks presented the best classification result because, despite having an area under the curve (AUC) as high as Logistic Regression (both with 99.7%), it presented the highest classification accuracy (CA) among all algorithms with 96.7%, followed by Logistic Regression and its 96.1%. Among the six samples that were misclassified, four show mild biodegradation with the loss of light compounds (<nC₁₆) or a slight rise in UCM (Figure 5).

Model	AUC	CA
Table 2. Classification training results for the five ML algorithms. Note that the Neural Networks algorithm presented the highest classification accuracy (CA) of the group, followed by Logistic Regression.
Decision Tree	0.889	0.928
Random Forest	0.973	0.939
Neural Network	0.997	0.967
Naive Bayes	0.940	0.939
Logistic Regression	0.997	0.961

Figure 5. Results of misclassified samples in the training step. Figures (a) and (d) represent non-biodegraded oils; however, CNN classified them as biodegraded. Note the small parabola in the region of the lighter compounds, which is related to the original composition of the organic matter and may have misled CNN analysis for figures (a) and (d). Figures (b) and (c)represent biodegraded oils; however, CNN classified (c) as non-biodegraded. Observing the lighter compounds’ loss means there was a slight biodegradation, which may have misled CNN analysis for figure (c). Figure (b) was purposely misclassified as non-biodegraded in the training step; however, CNN classified it as biodegraded.

Once the prediction model was established, the next step was intended to test the model through the processing and classification of 41 images not yet classified. The test result (Table 3) shows that the AUC achieved was 99.7%, with an accuracy of 97.6%, which is even better than the training result. The confusion matrix of the test samples indicates that only one sample was misclassified; however, this sample shows characteristic elements of contamination by drilling fluid, like a prominent pike at nC₁₃ to nC₁₇ compounds (Figure 6a). [5] The result of the mixture of severe biodegraded oil (note the 25-norhopane peak in Figure 6b) and drilling fluid will be an oil-derived chromatogram with no distinguishable elements of biodegradation. Therefore, the test’s prediction error is actually a hit (Figure 6).

References

Notes

This presentation is faithful to the original, with only a few minor changes to presentation, spelling, and grammar. In some cases important information was missing from the references, and that information was added. The footnote at the end of the original version was turned into a formal citation for this version.