As Data Scientist’s our code is typically used for ad-hoc exploratory data analysis and model building and as such could be perceived to be quite scrappy when compared to the well designed code written by our software engineer colleagues.
Nevertheless, it’s often the case that we find ourselves writing the same or very similar functions over and over again. Using Python, we can expedite our workflow by putting these functions in a module, for use in other programs.
In Python whenever we use an import statement we are using modules, whether that is importing from one of Python’s built in packages, a third party package like pandas or numpy or a module that we’ve written ourselves.
Python modules are simply text files with the .py file extension – typically containing variable and function definitions.
Here’s a quick example of how a module might be structured.
As you can see there’s not much difference between this and a normal python script that you might run. However, now that we have defined our functions and variables in a module we can now import them directly into other programs (so long as our program is running in the same directory as our module).
Having to write the modules name before calling one of it’s elements does seem a little clunky. We can instead specify the functions/variables we wish to use in our import statement, or import all of a module’s contents to our namespace by using import *.
These alternative methods are considered bad practice as they can lead to confusion when some of our imports share names with other items in the global namespace.
It is possible however to reassign name of an element within a module to a local name.
Notice that the functions defined in our module are quite closely related to one another and would typically be performed on the same object.
If we have ‘L’ as an input, we might want to check if it is a letter and convert it to lower case. We may then want add a new function that allows us to check if our letter is a vowel or not.
In python we use classes when we wish to create logical groupings of functions, variables and data. If instead of a module we created a class containing the functions above, it would look something like:
At first glance this may seem more involved than defining a module especially with the addition of the new parameters __init__ and self.
The __init__ function is run when we create a new Letter instance. It tells python that we want to instantiate a new object.
self is how we refer to things in the class from within itself. self is the instance of Letter that check_if_letter is being called on.
Classes are similar to modules, in that we can access their elements using dot operators. The benefit of using classes over modules is that we are able to create multiple independent instances of our class.
Both objects contain their own functions and variables inside of them – they are totally independent of one another.
In this toy instance functions may well have been sufficient but when for example we wish to model the properties of an Asset or track the behaviours of a user, Classes definitely come in handy.
An excellent introduction to Python classes can be found here.
Now that we’ve managed to structure our code for reusability, how we use it from any directory on our machine and how do we share it with others?
Fortunately, Python features a number of built-in modules that allow you to package and distribute your code, namely setuptools and distutils.
The accepted practice for distributing a python package is to structure it as follows:
- data_science_package – The name of the main directory doesn’t matter too much but the usual practice is to give it the same name as the package that you wish to share
- setup.py – This file contains information about the package. It allows for easy installation of the package and it’s dependencies
- data_science_package – This is the folder containing the actual code
- __init__.py – This file tells Python that this directory is a module
- letters.py – File containing python code
The setup.py file can be kept relatively simple, especially if only installing the package locally or amongst a small number of people.
Now we navigate to the directory and from the command line enter ‘python setup.py install’ and the package will be accessible from anywhere on that machine.
Note that if we type develop rather than install, any changes made to the module are automatically reflected in our package which saves us having to install each time we make a change.
As our package grows and becomes more complicated the directory structure might change to include sub-modules. We could pass each and every module and module to the setuptools but in the interest of brevity setuptools has a find_packages function that does the hard work for us.
The package will most probably form dependencies on other modules and packages too. The setup function takes an argument called install_requires, this allows us to pass to the package dependencies as a list of strings.
By placing a relational operator next to the name of each dependency we can specify a version number, e.g all versions of numpy greater than 1.8.
Upon install if the package prerequisites are not found on the computer they’ll be automatically downloaded.
If the package is to be shared commercially (or even open sourced) so that it can be used other data scientists, then it needs to contain some package metadata. A complete list of metadata arguments that can be given to the setup function is available here. Whilst a more detailed guide into packaging python distribution is available at this link
Hopefully, this blog post helps Data Scientists out there to write cleaner, more modular code, speeds up their workflow and also saves them having their code mocked by software engineers!