data science Archives - Faro Analytics

September 3, 2019

Short Script for Cumulative Production Calculations

One minute explanation of this post…

So, we are moving the office to Boston and have been incredibly busy the past few weeks – and will probably continue to be busy for a few more. I have been remiss in posting, so I am passing along this Python script to help you calculate all your cumulative production numbers over time periods you specify. The coding is very straight forward and it also shows you a way to reshape a data frame and rename columns. The code, as always, is below and in our Github repository HERE.

Where can you use this?

Calculating cumulative production is a great way to check/back up your decline forecast results, but in our group we mostly use it for machine learning algorithms related to figuring out EURs, predicting production histories, and diagnosing potential production problems.

Some Sundry Data About the Permian in NM

We have been working quite a bit with New Mexico data and thought we would share some statistics with our readers. One piece of code we don’t share is our decline forecasting algorithm. Like every other analytics group out there, ours is the best you will find. Along with that, to answer a common question, “No, you don’t need to use machine learning to build your own.” If you are experienced and have a set procedure that you use, you can build those rules into a function – probably a very large function. The Numpy Python library offers a good deal of data fitting options, and if combined with good logic and some ingenuity using other freely available libraries, you can create some incredibly powerful software.

The hexbin graphs below show the distributions of oil, gas, and water declines vs. b-values by formation in Eddy and Lea counties, NM. The well set reflects horizontals drilled in the last 5 years with at least 12 months of consistent production data. In terms of b-values, we try to stay in line with optimistic auditor calls and max out at 1.6.

Code for Cumulative Production Script

August 13, 2019August 13, 2019

Free Well Header and Production Data for Wyoming

This will be the last “free data” post for a while as we are sure everyone who frequents our blog wants to see other uses for Python in the oil and gas industry – we may even do other sectors coming up, so keep checking back. We forgot to include Wyoming in our previous post regarding where to find cheap public oil and gas data sets for the U.S. – and we think it would be remiss not to include the great state of Wyoming on our list considering how much free, high quality data they offer on their site. So, lets get started…

Where to Get It and What They Have

The main page for the Wyoming data sources is HERE. You can see they are definitely not stingy with the data sharing. Outside of the production and header data we will show you how to pull and reshape, they give you access to an incredibly large amount of information for a state website. One link in particular stands out to us for future analysis and that is their gas plant data. It looks like it is only in aggregate at the state level, but they drill down and provide you with links to their individual plant data (in Excel format). Incredibly helpful and insightful.

For the data we will be looking at, go HERE. It looks like a site address that may change from computer to computer, so you can also get to the same thing we are looking at by following the “download” link on that initial list page we linked to at the beginning of this section. Click it and you will come to this page:

You should separately select the two items in the menu we have grayed out on the image. Select one, click the cowboy to the left of the menu, then do the same for the other file. Once you download those two zipped files and perform the extraction, you will have 2 DBF files (well header data for wells that have been permanently P&A’d along with the same data for all other well statuses) and 4 Excel files (monthly production broken out by groups of counties).

Basic Process to Reshape the files

This will be a very short post because, as usual, we provide you the code below and on our GitHub repository – and that code is very well detailed as to what is happening at each step. The one thing we do want to mention is that along with the standard data that comes in the well header data, we also show a way to get a good estimate on lateral lengths using geopy and the lat/longs for surface and bottom hole locations. If you follow how we have it set up, you can apply this to every other data set you may have where you have the lat/longs, but not necessarily the actual lateral lengths for horizontal wells.

Final Note

We did a little cleaning on this, but, like everything else you find on the internet, you will need to do some editing yourself. You probably have a much smaller area of interest that you will want to dig down into and make sure is as clean as possible for whatever project you are working on. Though, in the meantime, you now have a way to get a large amount of what you would need to do evaluations on wells in Wyoming – ~164,000 wells and 17 million+ lines of production values.

Code

GitHub repository HERE.

July 24, 2019July 24, 2019

Pulling, Cleaning, and Using Data from FracFocus

So, you downloaded the FracFocus database – now what the hell do you do with it? In this post we will show you where to get it, how to load it , clue you into some general issues with working in it, show you a Python program that cleans it and gets the pertinent information, and then we will close with showing you some insights from it via Tableau. We won’t lie, we were going to show you how to build some prettier graphs and maps in Python, but my Geopandas library is acting up because I decided to update a library which will require some other updated libraries.

Where Do You Get It and How Do You Load It?

You can pick up the database HERE. They give you a query to connect all the tables in Microsoft SQL Server Management Studio on this page, as well. Note: we aren’t going to do anything with MS products minus load the FracFocus db into it. Though, you will use that query they give you in the Python program. Also, if you don’t have Management Studio, or other software that reads MS databases, it is easy to find an installation and instructions for setting it up.

To load it: select Databases in the Object Explorer, right click it, and select Restore Database. Select the Device radio button on the Restore database screen and navigate to where you have put the FracFocus.bak file and select OK. It will load the database and you are done with SSMS.

Loading the Database Into Python and Viewing the Data

We are posting the whole program below and on our GitHub repository. So, to get a better idea of what is going on, see the program. You will need the pyodbc library so the script can connect with MS SQL directly and pull the data from the database. The most challenging part of getting this to work is making sure you have the correct driver. You can see how we used the query they provide and give you a note on how to change the query to only view certain states and counties if you don’t need the whole database. One issue that you will come across after you load it is the fact that multiple columns have the same names and it makes it difficult to reference a certain version of the column you will want. We have included a function to rename those duplicated columns so you can erase them or use them as you see fit.

Data Quality

To put it succinctly: It is absolutely terrible. This is one of those data sources that if you want to practice your data cleaning skills, this is a great opportunity. We think we have given you a good 80% start on the task, but you can spend a lot more time going through this with a fine tooth comb. Perhaps you can get a few thousand more wells cleaned up to give you a better data pool. Going through the mammoth list of ingredients and purposes, you find out that there is, indeed, 200 different ways to spell “naphthalene” or “ethylenediamine triacetic acid”. You also get an idea of the attitude of the people entering the data. For example:

“Aquafina”: We really don’t care how we spend our sponsor’s money…

“Dihydrogen Monoxide”: We just want to poison the Earth.

“Essential Oils”: Optimizes a frac job while aligning your chakras.

“Pee”: We are straight up honest.

“Contains hazardous substances in high concentrations”: Their lack of the word “No” has unknowingly given them a very “Come at me, bro” attitude toward the EPA.

“Contains no hazardous substances” all the way down that well’s ingredient list: Reminds me of this clip from Super Troopers…”Don’t worry about that little guy.”

One thing you can do to help out any searching you will do is eliminate \t (tab) and \n (newline) tags. Also, converting everything to upper case can make searches or other cleaning methods a little easier. And, of course, the universal issue when talking about proportions of anything – percents that are represented as whole numbers and decimals. You will most definitely need to find which ones are which and standardize them.

Break Down Between Fluid and Proppant

We have run the cleaning on the database two different ways. Use the total fluid value for each well and subtract that from 100, with the remaining percentage being proppant, or the reverse of that. We use the reverse; 100% – proppant percentage as we have had better results. It is up to you, but the code provided calculates percentages our way. The one thing we would change, if you plan on using this, is replace our proppant filter list with a text search using regex (regular expressions) to streamline this. We use a list to filter proppants here because it was easier for us – we keep that list along with various categories of fluids, citric acid use (for Permian well studies), and other sundry ingredients for analysis. There is no better time saver than cut and paste.

Post Cleaning and Results

After you have cleaned everything to your desired level, and have eliminated outliers using statistical methods, you have a relatively decent oil and gas data set. In a later post, we will show you how to use geopandas to plot maps in python and some more presentation worthy visualizations, but for now I am sure you are fine with seeing this in Tableau format on the Tableau Public site. If you have never used it, as long as you share the data, you can build workbooks for free.

Tableau Results

For the full Tableau workbook in full screen mode, go HERE.

Python Program Code

For the full Python program code, go HERE.

July 16, 2019July 16, 2019

How to Get Free Oil and Gas Data for One of the Most Active Basins

When you start searching the internet for oil and gas data, the first thing you will probably come across are sites for expensive data services. For small consultant groups and companies, this can be quite a dent in a budget. Why should you pay for data that is easily available when that money can be spent on buying interests or actual operating expenses? A lot of what you are looking for can be acquired for free or at a really low price. In this post we will give you the links to some of these sites and show you the way to get at some data that is very useful right now.

State Sites

California – Great amount of data stored in Access databases. They have a search capability for one-off wells, but we think what you really want is found HERE. Unfortunately, you will have to combine the databases via a SQL query, but that is a small amount of effort when it comes to the fact that everything the state has is free. In a future post we will show you how to take data from SQL and work with it, much faster, in Python. Of note for California, watch when wells switch from injection to production and vice versa in the database production table (due to steam floods).

Pennsylvania – Not bad, the state finally started keeping track of production by well, monthly, in April of 2015. This is where you can collect all sorts of PA data, but, more specifically, production in an Excel format HERE. Be aware of the awful data quality prior to the aforementioned time. Select the data you want and download it, en masse.

Ohio – Ugh, Ohio. Quarterly data for unconventionals, but at least it is allocated by well. Many of the older wells are single values by year. A way we think about how to get the unconventional quarterly to a monthly format is to look at investor presentations for the operators you can find and see who chokes their wells for a flat initial monthly profile (number of months) and use that with general b values and declines from older wells to calculate the breakdown. There is also a well database setup available – the intent is good, the execution is horrible. The database tool uses an old Access database version and their own queries throw errors if you have a more recent version of the program. You can still get at the data, though. Open it ignoring the errors and copy it into a usable format. This is an automation project we will most likely do in the near future.

North Dakota – $175. $175 and you are able to get all of the state data for an entire year. Way to go ND. You probably pay more on a monthly basis for some Adobe products.

Montana – Not much going on oil and gas wise in the state, except it being the western edge of the Williston. The search is HERE.

Louisiana – The opposite of ND – charges a ridiculous amount for one state in terms of a mass download. The well search is HERE.

Texas and Oklahoma – This is 4 or 5 posts in itself. The Texas Railroad Commission allocates by lease and the search functionality is very unfriendly, though you can get some decent data from University Lands for a large number of wells. You can buy the state data and write an allocation program yourself, too – and the pricing from the state isn’t terrible. In terms of Oklahoma… Well, there is some data that is easily pulled, but you realize what you are in for when their front page says, “Oil production has not been tracked by the Commission for several years, and is not available from our database. ” Because, you know, why bother keeping track of oil? It is possible to get this data and make it work, but, again, it is a little more involved than just a few posts.

New Mexico Data

So we said at the beginning of the article that we would show you how to go about getting data for an area that is popular right now. New Mexico’s Eddy and Lea counties are in the northwest portion of the Permian Basin – a large amount of the Delaware basin. We are going to show you where to get all of it at once (and we mean ALL of the wells…in the state), how to view it, and how to work with it by giving you a small example that will pull all of the effective lateral lengths with their corresponding APIs and some pointers on how to clean the data before using it. Also, keep in mind, NM updates their production quarterly.

Where to Find the New Mexico Data

To start, if you are looking for just a single well or maybe a few wells, you should use their well search . Otherwise, to download all of it, go HERE and follow the link to the FTP server. From there, select the OCD Data folder. The OCD_DataDictionary.xls file will really help with trying to figure out what data you would want from the proper file. The T_WC_VOL (production) and OCDExport (other general data) will be the zip files that contain the bulk of the information. There is a lot of good data in most of the FTP folders, but the two files mentioned are going to be key.

The Not-So-Easy Part

All of the files you will download, minus the data dictionary, are XML files. There are many other formats I prefer over XML, and I like to joke that it is “literally the least you could do” in getting data out there in a usable form. If you have an XML reader, great. If not, you will probably try to open it with some other program – perhaps using notepad, some browser, or any application that converts something to text. Seeing as how a majority of these files are larger than 1 GB in size, you may be waiting a while for your request to complete – if your computer just doesn’t crash trying to handle it. There is an incredibly easy way to view, parse, and use these files in Python.

Program Description

The program requires 3 libraries: Pandas, Numpy, and ElementTree. If you go on to cleaning the data, you will probably also want to import Matplotlib.pyplot to build some histograms to see where data has to be culled or find outliers. After the import, it only takes 2 lines of script to view some top portion of the file and get an idea of how it is structured. You can replace the ‘xxx’ portion of the xmlf.read(xxx) statement to view as many characters as you want until you have a good understanding of the patterns. The two pickle statements that are commented out below the main data acquisition loop allow you to save or read the final file from the loop. I usually keep those two lines in a program if something takes a bit of time to run so I have that portion saved and don’t have to rerun that part. By the time the looping is done, you will have data for 122,470 wells in terms of IDs and perf footage values. After that, the program combines the identification values to get full API numbers and subtracts the perf values to get effective lateral lengths. The full program can be found HERE.

Tips

I would suggest that if you use this, you do some cleaning on the data. This is just a basic example to show how to get the data and give some pointers on what to watch out for. Speaking of pointers, when you sort the data by the perf locations and distances, you will find that either companies are drilling 99,999′, 9,999′, and 9′ wells incredibly often or that they are able to drill in negative feet. Different people will clean the data in different ways. Most up for debate would be what a user would call a cut off for the minimum and maximum lengths of a lateral in this part of the world. A good way to tie all of this together is to use this data extract script to get more qualitative well data and figure out which wells are horizontals in the first place, where you have duplicate wells and take the greater/lesser of the two values, and if that general well data you just pulled to clean this well data is, itself, clean. Data set secret: NM has a lot of wells in their database tagged as verticals that suspiciously end with a number and the letter ‘H’ and were drilled in the last 3-5 years, so heads up.

July 10, 2019July 23, 2019

A Script for Pulling Oil and Gas Prices

This small price fetching script comes from a much larger program in Faro’s library which allows a company to get a total view of a basin’s metrics in an incredibly short time frame. The program will:

Clean all the production, geologic, and sundry data that it is fed.
Create other metrics that can be calculated from government / state data, the FracFocus database, and other sources.
Normalize and high-point align production data for all wells.
Use our proprietary curve fitting algorithm to automate fitting all production streams for wells over 12 months old.
Utilizing the metrics from those forecasts, along with the general well data, use machine learning algorithms to properly classify the younger vintage wells and predict type curve areas for new locations.
Use best estimate capital costs and lease operating expenses to forecast all well economics – and if the data is available, perform look backs.
Generate geologic, production, EUR, and IRR maps (in both Python and other mapping software).

About the Program and Price Sites

Before you use this script, you will have to install a webdriver for the web browser you prefer (if you don’t already have the driver). In this case, we use Firefox. Chrome is great, but when it comes to looking at what is going on in the background of a website we prefer Firefox’s tools. So, to get GeckoDriver for Firefox, go to this page.

Typically, when you are looking at a site from which to pull data, you can get an idea of the table format by just right clicking anywhere on the page and choosing to view the page source. For example, if you go to the EIA page for Cushing, OK WTI daily historic prices and view the page and its source, you will see:

*This is straight forward – The table is on the source page with a title (table SUMMARY), individual tables are* *tagged with “tr”, and the individual cells are tagged with “td”. This is very easy for a parser to pick up and process.*

Getting the Data

In the case of the CME site, when you try to view the source, no tables or tags will pop up, but there is an easy work around to get to the originating data url. Go to the CME site in your Firefox browser and in the upper right hand corner there should be a drop down menu. From there, go to Web Developer and select Network. There are some options at the top of that window – “All HTML CSS JS XHR…”. Pick XHR and reload the page. You will see one code in particular that keeps popping up with a long, changing number after the refresh. That is going to be the site with the actual prices. Right click it, open up the link, and select “Raw Data” at the top of that page. That is all of the data which you will parse out using the descriptors in front of the values you want.

For clarification, that long number at the end of the link you follow is Unix time (number of seconds since Jan. 01, 1970). It will play a part in the program by using a version of that number to create a url.

Jupyter Notebook Script

We have posted the full code for pulling both the CME prices and monthly historic prices from the EIA website on our github repository HERE. We only show oil in this example, but it is just a matter of changing the urls and switching your variable names to reflect gas.

Additions You May Want to Make

One issue with the data you get is that there could be a break in the months between historic and futures prices. The EIA is slow to update history, and futures prices move quick. We account for this in our main program with some other pulls, but you can easily add two months with your estimated/known prices to the bottom of the resulting dataframe and re-sort it if you decide to put them together.

Interested in Learning Python?

If you are new to python or are interested in getting started on it, one of the more popular distributions is Anaconda. It is free and, in our opinion, more friendly than other python distributions when it comes to set up, managing packages, and building environments. You can download it here and will be able to set up your own Jupyter Notebook IDE.