Python Packages 101 – Part 1
Overview
You have decided to learn Python and you have even picked up on all the basics such as While and For loops, If statements and using lists and dictionaries. Now you want to get serious and have no idea what to learn next. All your Python expert friends keep throwing around silly names such as pandas, beautiful soup, seaborn and spacy and you are not sure where to start.
If that sounds all too familiar, fret not! This article will guide you through the must known items to set up and use Python effectively:
- A quick overview of the top packages used by data scientists and financial professionals in Python organized in practical categories
- A cheat sheet of all the popular packages along with their respective codes needed to install these packages with Anaconda and links for more documentation
This article profiles the top packages that bring core functionality to Python and allow you to perform key data tasks efficiently. Our upcoming second article in this two-part series will discuss:
- An overview of packages for additional categories such as statistical analysis and machine learning
- An overview of Anaconda and a guide on how to find, install and update packages with Anaconda
Popular Packages
Python has been around for more than 30 years. However, what has really made it popular over the last decade has been the introduction of popular packages such as pandas (released in 2008, used for data manipulation) and beautiful soup (released in 2004, used for web scraping). These packages are created by other developers in the community and submitted to package repositories such as PyPI (Python Package Index) and the Anaconda Repo. As of this article’s date (Mar 2021) there are more than 290,000 packages on PyPI, which could be very overwhelming for a programmer new to Python.
We have picked 25 of the top packages used by business professionals that are also taught in our various Python classes at Marquee. All these packages are free to download and use with Python and can be easily installed with the Anaconda software. We have broken these top packages into the following categories:
- Data manipulation: NumPy; Pandas
- Web scraping: Beautiful Soup; Selenium; Requests; Urllib3
- Visualization: Matplotlib; Seaborn; Bokeh; Plotly
In our follow-up article we will also cover packages in the following more advanced categories:
- Dashboarding: Dash; Streamlit
- Statistical analysis: Statsmodels; SciPy
- File management: OS; Pathlib; Shutil; Pillow; Camelot; Tabula
- Machine learning: Scikit-learn; NLTK; SpaCy; OpenCv; PyTesseract
Data Manipulation Packages
Data manipulation is one of the most common uses of Python by business and finance professionals. Some data sets are too large to work with in Excel and Python allows for very quick and automated manipulations such as cleaning up data sets, sorting, filtering, lookups and quick statistical analysis. However, storing complex data in Python’s built-in lists and dictionaries structures can be inefficient or tedious to work with. That is where additional packages such as numpy and pandas come in to help with more advanced data manipulations and analysis.
NumPy
Website: https://numpy.org/ Documentation: https://numpy.org/doc/stable/ |
||
Functionality | Primary Uses | Use in Finance |
|
|
|
Pandas
Website: https://pandas.pydata.org/ Documentation: https://pandas.pydata.org/docs/ Cheat sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf |
||
Functionality | Primary Uses | Use in Finance |
|
|
|
Web Scraping Packages
Web “scraping” is a generic term that means aggregating or extracting data from websites. Scraping can be done by opening a browser from within a programming language, navigating to a specific website and then downloading the data from that web page either directly into the programming language or as separate files. Sometimes, a browser doesn’t even need to be opened and the programming language code can access the data directly from the server. While sometimes simple to do, web scraping can be one of the hardest things to code in a programming language, due to the complexity and variability in how information is stored on websites. For more complex websites, prior knowledge of web design (HTML, CSS, JavaScript) is helpful; however, with a bit of trial and error, and by using the web scraping packages available in Python, one can quickly get the data downloaded in the proper format.
If the data is displayed in a tabular format on the website, it can easily be scraped with the Pandas package mentioned in the previous section. Otherwise, data can be extracted using a combination of the Beautiful Soup package and a package to connect to the website (e.g. requests or urllib3). Selenium is also used if there is a need for interaction with the site (e.g. logging in, clicking on a button, or filling out a form).
Business and finance professionals use web scraping with Python to perform more extensive due diligence on their clients, competitors, or potential investments, such as analyzing store locations or grabbing pricing and inventory information on products.
Beautiful Soup
Website: https://www.crummy.com/software/BeautifulSoup/ Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ |
||
Functionality | Primary Uses | Use in Finance |
|
|
|
Selenium
Website: https://www.selenium.dev/ Documentation: https://selenium-python.readthedocs.io/ |
||
Functionality | Primary Uses | Use in Finance |
|
|
|
Urllib3
Website: https://urllib3.readthedocs.io/en/latest/ Documentation: https://urllib3.readthedocs.io/en/latest/user-guide.html |
||
Functionality | Primary Uses | Use in Finance |
|
|
|
Requests
- Website and documentation: https://requests.readthedocs.io/en/master/
- Another HTTP package that is used in combination with Beautiful Soup to connect to websites
- Alternative to urllib3; requests package actually uses urllib3 “under the hood” and has streamlined many of the functions that exist in urllib3 to make it simpler to connect to websites and retrieve data
Visualization Packages
When it comes to visualizing data sets, business professionals resort to two extremes for creating charts. On the “quick and dirty” end of the spectrum data analysts have resorted to creating their graphs in Excel. The simple user interface allows for quick creation of charts and customization of all possible chart elements with just a double click away – from the title of the graph to the font and size of the axis labels. However, charts in Excel are usually not interactive and also tedious to update for new data sets or different configurations of displaying the data. On the “premium” end of the spectrum, data analysts use more advanced dashboarding software programs such as Tableau and Microsoft’s Power BI. These programs are great for creating stunning and interactive visualizations; however, they also have a steeper learning curve and usually a cost associated with more advanced features.
Python visualization packages are a great compromise between these two alternatives. The coding nature of Python allows for the creation and automation of multiple charts in a matter of seconds. Also, by adding small incremental code to the visualization functions, anything from the data labels to the color of the markers can be customized. There are also several packages such as Bokeh and Plotly that allow interactive charts to be exported as stand-alone HTML files that can be shared with colleagues at work. One last important point to mention – all these packages are free to install and use.
matplotlib
- Website: https://matplotlib.org/
- Examples Gallery: https://matplotlib.org/stable/gallery/index.html
- Documentation: https://matplotlib.org/stable/contents.html
- 2D and 3D visualization package
- Makes graphs similar to those in MATLAB
- Useful for time series and cross-sectional data
- Works very well with the pandas package – columns from DataFrames can be used as the source data for x and y coordinates on scatter plots, line charts and bar charts
- Charts are highly customizable with custom functions for titles, legends, and annotations
- However, some of the functions are not as intuitive as in other Python visualization packages
- Charts can be exported as jpg or png files or embedded as outputs in Jupyter Notebook files (a web-based coding environment for writing and running Python code)
seaborn
- Website: https://seaborn.pydata.org/
- Examples Gallery: https://seaborn.pydata.org/examples/index.html
- Documentation: https://seaborn.pydata.org/tutorial.html
- Data visualization packaged based on matplotlib
- It provides a higher-level interface with more streamlined and easier to use functions than matplotlib
- However, it provides less customizations and instead has built-in “styles”
Plotly
- Website and Examples Gallery: https://plotly.com/python/
- Interactive plotting library that supports more than 40 charts
- Plotly.Express module is part of the Plotly package and contains functions that allow for creation of entire interactive charts in one or few lines of codes
- Plotly Express documentation: https://plotly.com/python api reference/plotly.express.html
- Plotly also allows for creation of standalone webpages that can be shared with and viewed by teammates who do not have Python installed on their devices
- The charts remain fully interactive and are not just simple pictures as in the case with matplotlib
- Example of interactivity: zooming in on sections of the graph, filtering out categories plotted on the graph by clicking on legend labels, more detailed information being displayed on the graph while mouse is hovering over data points, drilling down on categories in sunburst charts or tree maps
Bokeh
-
- Website: https://docs.bokeh.org/en/latest/index.html
- Examples Gallery: https://docs.bokeh.org/en/latest/docs/gallery.html
- Another interactive visualization library that creates charts in standalone webpages
- Contains several modules that allow for customization of different components of the graphs and the layout of the output webpage
- E.g. the tools that show up in the side toolbar can be customized (pan, zoom, reset, etc.) and extra widgets can be added to charts such as dropdowns, buttons, etc.
- The codes are more complex than Plotly Express but more customizations can be made
Cheat Sheet and Next Article
Below is a link to a cheat sheet summarizing all packages discussed in this article and the ones that will be covered in our Part 2 of this series. The cheat sheet provides a summary table with all the packages, their categories, conda codes to install these packages with Anaconda and links to the documentation and Anaconda Repo.
Python Packages – Cheat Sheet.pdf
In our next article we will discuss packages used to automate file management (creating, deleting, moving files and folders and importing images and pdf files), run more advanced statistical analysis in business and finance applications (time series analysis, linear regressions and optimization problems), create advanced interactive dashboards for deeper analysis and insights into data sets, and take advantage of advanced machine learning algorithms (NLP – natural language processing, OCR – optical character recognition, advanced forecasting and prediction models).
About Marquee
The Marquee Group can provide comprehensive Python training to your team to learn more simply email us at info@themarqueegroup.com, or you can check out our self-study courses including Python Fundamentals, Applied Machine Learning or our Python Bundle.