Here are the 20 Python Packages you should know for all your Data Science, Data Engineering, and Machine Learning projects. Those are the packages that the machine learning engineer Sandro Luck recently listed as the most useful during his career as an Engineer and Python Programmer.
1. Open CV
The open-source computer vision library, Open-Cv, is your best friend when it comes to images and videos. It offers great efficient solutions to common image problems such as face detection and object detection. If you are planning to work with Images in data science this library is a must.
Data visualization is your main way to communicate with non-Data Wizards. If you think about it, even apps are merely a way to visualize various data interactions behind the scene. Matplolib is the basis of image visualization in python, from visualizing your edge detection algorithm to looking at distributions in your data, Matplolib is your partner in crime.
Given that we are talking about Python packages, we have to take a moment to talk about their master PIP. Without it, you can’t install any of the others. Its only purpose is to install packages from the Python Package Index or places such as GitHub. But you can also use it to install your own custom-build packages.
Python wouldn’t be the most popular programming language without Numpy. It is the foundation of all data science and machine learning packages, an essential package for all math-intensive computations with python. All that nasty linear algebra and fancy math you learned in university are basically handled by Numpy in a very efficient way. Its syntax style can be seen in many of the important data libraries.
Build mostly on Numpy it is the heart of all data science you can ever do with python. “Import pandas as PD” is much more than excel on steroids. Its declared goal is to become the most powerful open source data tool available in any language, and maybe they are more than halfway there.
If you ever worked with dates in Python, you know doing it without dateutil is a pain. It can compute given the current date, the next month, or the distance between dates in seconds. And most importantly it handles the timezone issues for you, which if you ever tried doing it without a library can be a massive pain.
If Machine Learning is your passion, the Scikit-Learn project got you covered. The best place to get started and also the first place to look for any algorithm that you could possibly want to use for your predictions. It also features tons of handy evaluation methods and training helpers, such as grid search. Whatever predictions you are trying to get out of your data, sklearn will help you do it more efficiently.
This is kind of confusing, but there is a Scipy library and there is a Scipy stack. This includes Numpy, Matplolib, IPython, and Pandas. Just like Numpy, you most probably won’t use Scipy itself, but the above-mentioned Scikit-Learn library heavily relies on it. Scipy provides the core mathematical methods to do the complex machine learning processes.
If you ever wondered what my favourite Python package is, look no further, it’s this stupid little application called TQDM. All it really does is that it gives you a processing bar that you can throw around any for loop and it will give you a progress bar that tells you how long each iteration takes on average, and most importantly how long it will take such that you know exactly for how long you can watch youtube videos before you have to go back to work
The most popular Deep Learning framework and really what made python what it is today. Tensorflow is an entire end-to-end open-source machine learning platform that includes many more packages and tools such as tensorboard, collab, and the What-If tool. Chosen by many of the world-leading companies for their deep learning needs, TensorFlow is with a staggering 159’000 stars on Github the most popular python package of all time. It is used for various deep learning use cases by companies such as Coca-Cola, Twitter, Intel, and its creator Google.
A deep learning framework made for humans as their slogan goes. It made rapidly developing new neural networks a thing. Keras is based on top of TensorFlow and really the way developers start when they first try around with a new architecture for their model. It reduced the entry barrier for starting to program neural networks by so much that most high school students could do it by now.
Tensorflows’ main competitor in the deep learning space. It has become a great alternative for developing neural networks. Its community is a bit stronger in the realm of Natural Language processing, while TensorFlow tends to be a bit more on the image and video side. As with Keras, it has its own simplifying library Pytorch lightning.
Statsmodel in contrast to the fancy new Machine Learning world is your door to the classical world of statistics. It contains many helpful statistical evaluations and tests. In contrast, these tend to be a lot more stable and surely something any Data Scientist should use every now and then.
The big alternative to Matplolib is Plotly, objectively more beautiful, and far better for interactive data visualizations. The main difference to matplolib is that it is browser-based and slightly harder to start with, but once you understand the basics it is truly an amazing tool. Its strong integration with Jupyter makes me believe that it will become more and more standard and make people move away from matplotlib integrations.
Short for the Natural Language Toolkit is your best friend when you are trying to make sense of any text. It contains extensive algorithms for various grammatical transformations such as stemming and incredible lists of symbols that you might want to remove before processing text in your models, such as dots and stop words. It will also detect what is most likely a sentence and what is not, to correct grammatical errors made by the “writers” of your dataset.
If you ever tried doing data science without data, probably you realized that is rather pointless. Luckily the internet contains information about almost everything. But sometimes it’s not stored in a nice CSV-like format and you first have to go out into the wild and gather it. This is exactly where scrapy can help you by making it easy to crawl websites around the globe using a few lines of code. Next time you have an idea where no one pre-gathered the dataset for you.
- Beautiful Soup
A very similar use case, often these damn web developers store their data in an inferior data structure called HTML. To make use of that nested craziness beautiful soup has been created. It helps you extract various aspects of the HTML such as titles and tags, and lets you iterate them like normal dictionaries.
Once our dataset size crosses a certain terabyte threshold it can be hard to use the common vanilla implementation of Machine Learning algorithms often offered. XGBoost is there to rescue you from waiting weeks for the computations to end. It is a highly scalable and distributed gradient boosting library that will make sure your calculations run as efficiently as possible.
Data Engineering is part of every Data Science workflow, and if you ever tried to process billions of data points you know that your conventional for loop will only get you this far. PySpark is the python implementation of the very popular Apache Spark data processing engine. It is like pandas but build with distributed computing in mind from the very beginning. If you ever get the feeling that you can’t process your data fast enough to keep track this surely is exactly what you need. They also started focusing on massive parallel Machine Learning for your very big data use cases.
Urllib3 is a powerful, user-friendly HTTP client for Python. If you are trying to do anything with the internet in Python, this or something that builds on it is a must. API crawlers and connection to various external data sources included.