Pulling, Cleaning, and Using Data from FracFocus

So, you downloaded the FracFocus database – now what the hell do you do with it? In this post we will show you where to get it, how to load it , clue you into some general issues with working in it, show you a Python program that cleans it and gets the pertinent information, and then we will close with showing you some insights from it via Tableau. We won’t lie, we were going to show you how to build some prettier graphs and maps in Python, but my Geopandas library is acting up because I decided to update a library which will require some other updated libraries.

Where Do You Get It and How Do You Load It?

You can pick up the database HERE. They give you a query to connect all the tables in Microsoft SQL Server Management Studio on this page, as well. Note: we aren’t going to do anything with MS products minus load the FracFocus db into it. Though, you will use that query they give you in the Python program. Also, if you don’t have Management Studio, or other software that reads MS databases, it is easy to find an installation and instructions for setting it up.

To load it: select Databases in the Object Explorer, right click it, and select Restore Database. Select the Device radio button on the Restore database screen and navigate to where you have put the FracFocus.bak file and select OK. It will load the database and you are done with SSMS.

Loading the Database Into Python and Viewing the Data

We are posting the whole program below and on our GitHub repository. So, to get a better idea of what is going on, see the program. You will need the pyodbc library so the script can connect with MS SQL directly and pull the data from the database. The most challenging part of getting this to work is making sure you have the correct driver. You can see how we used the query they provide and give you a note on how to change the query to only view certain states and counties if you don’t need the whole database. One issue that you will come across after you load it is the fact that multiple columns have the same names and it makes it difficult to reference a certain version of the column you will want. We have included a function to rename those duplicated columns so you can erase them or use them as you see fit.

Data Quality

To put it succinctly: It is absolutely terrible. This is one of those data sources that if you want to practice your data cleaning skills, this is a great opportunity. We think we have given you a good 80% start on the task, but you can spend a lot more time going through this with a fine tooth comb. Perhaps you can get a few thousand more wells cleaned up to give you a better data pool. Going through the mammoth list of ingredients and purposes, you find out that there is, indeed, 200 different ways to spell “naphthalene” or “ethylenediamine triacetic acid”. You also get an idea of the attitude of the people entering the data. For example:

“Aquafina”: We really don’t care how we spend our sponsor’s money…

“Dihydrogen Monoxide”: We just want to poison the Earth.

“Essential Oils”: Optimizes a frac job while aligning your chakras.

“Pee”: We are straight up honest.

“Contains hazardous substances in high concentrations”: Their lack of the word “No” has unknowingly given them a very “Come at me, bro” attitude toward the EPA.

“Contains no hazardous substances” all the way down that well’s ingredient list: Reminds me of this clip from Super Troopers…”Don’t worry about that little guy.”

One thing you can do to help out any searching you will do is eliminate \t (tab) and \n (newline) tags. Also, converting everything to upper case can make searches or other cleaning methods a little easier. And, of course, the universal issue when talking about proportions of anything – percents that are represented as whole numbers and decimals. You will most definitely need to find which ones are which and standardize them.

Break Down Between Fluid and Proppant

We have run the cleaning on the database two different ways. Use the total fluid value for each well and subtract that from 100, with the remaining percentage being proppant, or the reverse of that. We use the reverse; 100% – proppant percentage as we have had better results. It is up to you, but the code provided calculates percentages our way. The one thing we would change, if you plan on using this, is replace our proppant filter list with a text search using regex (regular expressions) to streamline this. We use a list to filter proppants here because it was easier for us – we keep that list along with various categories of fluids, citric acid use (for Permian well studies), and other sundry ingredients for analysis. There is no better time saver than cut and paste.

Post Cleaning and Results

After you have cleaned everything to your desired level, and have eliminated outliers using statistical methods, you have a relatively decent oil and gas data set. In a later post, we will show you how to use geopandas to plot maps in python and some more presentation worthy visualizations, but for now I am sure you are fine with seeing this in Tableau format on the Tableau Public site. If you have never used it, as long as you share the data, you can build workbooks for free.

Tableau Results

For the full Tableau workbook in full screen mode, go HERE.

Python Program Code

For the full Python program code, go HERE.