
Azure Synapse Analytics is Microsoft’s one-stop shop that integrates big data analytics and data warehousing. It features Apache Spark pools to provide scalable and cost-effective processing power for Spark workloads which enables you to quickly and easily process and analyse large volumes of data at scale.
I have been working with Spark environment on Azure Synapse for a while and thought of documenting the experience I had with installing external python libraries for Spark pools on Azure. This guideline may come handy for you if you are performing your big data analytics experiments with specific python libraries.
Apache Spark pools on Azure Synapse Workspaces comes with Anaconda python distribution and with pip as a package manager. Most of the native python libraries used in the data analytics space are already installed. If you need any additional packages to be installed and used within your scripts, there are 3 ways you can do it on Synapse.
- Use magic command on notebooks to install packages in session level.
- Upload the python package as a workspace package and install in Spark pool.
- Install packages using PIP or conda input file.
01. Use magic command on notebooks to install packages in session level.

This is the most simple and straight forward way of installing a python package in Spark session level. You just have to use the magic command followed up with the usual pythonic way of installing the package through pip or conda. Though this is easy for prototyping and quick experiments, it’s pointless if you are installing it over and over again when you start a new spark session. Better to avoid this method in production environments. Good for rapid prototyping experiments.
02. Upload the python package as a workspace package and install in Spark pool.

Azure Synapse workspace allows to have workspace packages uploaded and install on the Spark pools. It accepts python wheels (.whl files), jar files or tar.gz as packages.
After uploading the packages go for specific Apache Spark pool and then select the packages you want to install on it. The initial installation may take few minutes. (In my case, it took around 20 mins to install 3 packages)
With the experience I had with different python packages, I would stick with python wheels from pip or jars from official package distributions. I tried sentence-transformers tar.gz file from pyPI (https://pypi.org/project/sentence-transformers/ ). It gave me an error during the installation process mentioning a package dependency with R (which is confusing)
03. Install packages using PIP or conda input file.

If you are familiar with building conda environments or docker configurations, having a package list as a config file should not be new to you. You can specify either a .txt file or an .yml file with the desired package versions to be installed to the Spark cluster. If you want to specify a specific channel to get libraries, you should use a .yml file.
For an example if you need to install sentence-transformers package and spark-nlp python packages which are used in NLP for the Spark environment, you should add these two line in a .txt file and upload it as the input file for the workspace.
sentence-transformers===2.2.2
spark-nlp===4.4.1
I find this option as the most robust way of installing python packages to a Spark since it gives you the flexibility to select the desired package version and the specific channel to use during installation. It saves the time from installing the packages each time when the session get refreshed too.
Let me know your experience and the lessons learned during experiments with Apache Spark pools on Azure Synapse Analytics.
Here’s the Azure documentation on this for further reference.










For the safe side, use python virtual environments. It would avoid many hassles occurs with library dependencies. I’ve used anaconda environment as the base of virtual environment.
Here the API operates through POST methods which accepts the input in JSON.

No need to purchase Visual Studio enterprise or ultimate. The freely available Visual Studio Community edition works fine. In 2017 version python comes along side with the default installation options. For the later versions you need to install Python Tools for Visual Studio (PTVS) separately.

