[Unpopular opinion] Don’t use poetry for Python dependencies

Even though Poetry for Python project management has gained popularity in the last years, and even I used it in a few projects, like any tool, it may not be the perfect fit for every project or team. Here I summarised just a few arguments that could be made against using poetry:

  • Learning Curve: For teams or individuals already familiar with other Python packaging and dependency management tools like pip and virtualenv, or even pipenv, adopting poetry can introduce a learning curve. Understanding poetry’s way of managing dependencies, environments, and packages might require time, effort and numerous mistakes.
  • Overhead for Simple Projects: Poetry provides a comprehensive solution that might be overkill for very simple projects. For small scripts or applications with minimal dependencies, the overhead of managing a poetry environment not be justified. Do you have an API with 5-20 dependencies. Don’t use poetry. Doesn’t make any sense.
  • Performance Concerns: Poetry’s dependency resolution process can be slower than some alternatives, particularly for projects with a large number of dependencies. This could impact the speed of continuous integration builds or the responsiveness of development workflows. Personally, I had situation in which adding a new package and rebuilding the lock file was taking me more than 1 hour.
  • Migration Effort for Existing Projects: Migrating an existing project to poetry from another system can require a non-trivial effort. This includes not only technical changes to how dependencies are managed and packaged but also updating any related documentation, developer guides, and CI/CD pipelines. I faced this challenge and it took a long time to migrate all the projects on poetry and to get it right. Also, in the meantime we had to maintain two systems for the same projects.

While poetry may offer some advantages (and the Internet is full of arguments pro-poetry), weighing these potential drawbacks can help determine if it’s the right choice for your situation.

MLOps – Linear Regression

Linear regression is a fundamental statistical technique used to model the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between the independent variables and the dependent variable, allowing us to make predictions or estimate the value of the dependent variable based on the values of the independent variables.

Here are the key components and concepts of linear regression:

Dependent Variable: Also known as the target variable or response variable, this is the variable we want to predict or explain using the independent variables.

Independent Variables: These are the predictor variables or features that are believed to have an influence on the dependent variable. In simple linear regression, there is only one independent variable, while multiple linear regression involves more than one.

Linear Relationship: Linear regression assumes that the relationship between the independent and dependent variables can be expressed as a straight line. This line can be described by the equation: y = mx + b, where y is the dependent variable, x is the independent variable, m is the slope of the line, and b is the y-intercept.

Ordinary Least Squares (OLS): The most common method used to estimate the parameters (slope and intercept) of the linear regression line is the OLS technique. It aims to minimize the sum of squared differences between the predicted and actual values of the dependent variable.

Assumptions: Linear regression relies on several assumptions, including linearity (the relationship is approximately linear), independence of errors (residuals), constant variance of errors (homoscedasticity), normality of errors, and absence of multicollinearity (when using multiple independent variables).

Residuals: Residuals represent the differences between the actual values of the dependent variable and the predicted values by the regression line. Analyzing the residuals can help assess the model’s goodness of fit and identify any patterns or violations of assumptions.

Coefficient of Determination (R-squared): R-squared measures the proportion of the variance in the dependent variable that can be explained by the independent variables. It ranges from 0 to 1, with higher values indicating a better fit of the model to the data.

Interpretation of Coefficients: In linear regression, the coefficients (slope and intercept) provide insights into the relationship between the independent and dependent variables. The slope represents the change in the dependent variable for a one-unit change in the independent variable, while the intercept is the value of the dependent variable when the independent variable is zero.

Multiple Linear Regression: When there is more than one independent variable, multiple linear regression is used. The interpretation of coefficients becomes more nuanced, and additional considerations like multicollinearity arise.

Linear regression is widely used in various fields, including economics, social sciences, finance, and machine learning. It serves as a foundation for more advanced regression techniques and is often the first step in analyzing and modeling data.

An example of a real-world situation where linear regression can be applied: Predict the housing prices

Suppose you are working for a real estate agency, and your task is to predict housing prices based on various factors. In this scenario, linear regression can be used to model the relationship between the independent variables (such as the size of the house, number of bedrooms, location, etc.) and the dependent variable (the price of the house).

By collecting a dataset of historical housing sales that includes information on the independent variables (e.g., square footage, number of bedrooms, distance to amenities) and the corresponding sale prices, you can use linear regression to estimate the relationship between these variables and predict the prices of new houses.

After preprocessing and cleaning the data, you can perform linear regression analysis. The independent variables will serve as predictors, and the dependent variable will be the housing price. The regression model will estimate the slope and intercept of the line that best fits the data, allowing you to predict the price of a house based on its characteristics.

For example, the model may reveal that, on average, each additional bedroom adds a certain dollar amount to the house price, and each square foot of living space contributes a specific value. With this information, you can use the model to estimate the price of a new house by plugging in the corresponding values for the independent variables.

Linear regression in this context helps in understanding the impact of different factors on housing prices and enables the real estate agency to make informed decisions, such as pricing properties accurately, providing recommendations to buyers and sellers, and understanding the relative importance of different features in determining the house price.

It’s important to note that in real-world scenarios, multiple factors and more complex models are often employed to capture the intricacies of the housing market. However, linear regression provides a simple and intuitive starting point for understanding the relationship between variables and making predictions.

MLOps – What to know

There are several key areas you should focus when learning about MLOps. Here’s a breakdown of the topics you should consider:

Source: https://ml-ops.org/content/mlops-principles

  • Machine Learning Algorithms: Gain a solid understanding of various machine learning algorithms, such as linear regression, logistic regression, decision trees, random forests, support vector machines, k-nearest neighbors, naive Bayes, clustering algorithms (k-means, hierarchical clustering), and dimensionality reduction techniques (principal component analysis, t-SNE). Learn how these algorithms work, their strengths, weaknesses, and when to apply them.
  • Large Language Models (LLM): Familiarize yourself with large language models, which are powerful models trained on vast amounts of text data. Some prominent examples include OpenAI’s GPT models (like GPT-3) or models like BERT, XLNet, and GPT-2. Understand how these models are pre-trained on massive corpora and can be fine-tuned for specific tasks such as natural language understanding, text generation, or sentiment analysis.
  • Statistical Modeling: Acquire a strong foundation in statistical modeling techniques. Learn about probability theory, statistical distributions (e.g., Gaussian, Poisson), hypothesis testing, confidence intervals, regression analysis (linear regression, logistic regression), time series analysis, and Bayesian statistics. Familiarize yourself with statistical software tools like R or Python’s statistical libraries (e.g., scipy, statsmodels).
  • Data Manipulation and Analysis: Develop skills in data manipulation and analysis. Learn how to clean and preprocess data, handle missing values, perform feature engineering, and work with structured and unstructured data. Gain proficiency in data analysis libraries such as pandas and data visualization libraries like Matplotlib or Seaborn.
  • Programming: Master a programming language commonly used in machine learning and data analysis, such as Python or R. Learn the fundamentals of the language, control structures, data types, functions, and libraries relevant to machine learning and statistical modeling (e.g., scikit-learn, TensorFlow, PyTorch).
  • Mathematics and Probability: Strengthen your knowledge of mathematical concepts relevant to machine learning, such as linear algebra, calculus, and probability theory. Understand matrix operations, differentiation, optimization algorithms (gradient descent), and probability distributions.
  • Experimental Design and Evaluation: Learn about experimental design principles and how to evaluate machine learning models. Gain knowledge of techniques for cross-validation, model selection, performance metrics (accuracy, precision, recall, F1-score), and overfitting/underfitting.

Additionally, staying updated with the latest research papers, attending relevant workshops or conferences, and engaging in hands-on projects will help you deepen your understanding and practical skills in these areas.

Don’t repeat the logic in the unit tests

There are multiple mistakes you can do, as software engineer, when you define and write the unit tests for your software.

One of the most common I saw it was to repeat the logic from the tested function/method in the unit test function

### contants.py
HARD_UPPER = 10
SOFT_LOWER = 5
INTERMEDIATE_VALUES = [5, 6, 7, 8, 9, 10]

### functions.py
from constants import HARD_UPPER, SOFT_LOWER, INTERMEDIATE_VALUES

def my_func(x):
	if x in INTERMEDIATE_VALUES:
		return x + 1

	if x > HARD_UPPER:
		return x + x

	if x < SOFT_LOWER:
		return x ** x

Let’s imagine we want to test this dummy function, my_func and we would do it in this way:

from functions import my_func

def test_myfunc():
    test_values = [4, 7, 8, 10]
    for x in test_values:
        if x in [5, 6, 7, 8, 9, 10]:
            expected = x + 1
        if x > 10:
            expected = x + x
        if x < 5:
            expected = x ** x

        assert expected == my_func(x)

It looks nice. It tests the boundaries, it seems to test all the intervals. But, let’s imagine someone is going into constans.py and does this change, increases the SOFT_LOWER with 1 and removes 5 from INTERMEDIATE_VALUES.

SOFT_LOWER = 6
INTERMEDIATE_VALUES = [6, 7, 8, 9, 10]

If we run our tests, everything is green, but some results are not the expected ones, for example, my_func(5), before 5 was in INTERMEDIATE_VALUES and the result was 6. Now, 5 is under the condition if x < 5 so, the result is 5 ^ 5 = 3125.

Ofcourse, abose it is just a silly example where I tried to copy/paste the logic from the target function to the test and the easier fix would be just to try to hardcode the boundaries and some itermediate values, like:

def test_myfunc():
    assert my_func(4) == 256
    assert my_func(5) == 6
    assert my_func(10) == 11
    assert my_func(11) == 22
    assert my_func(8) == 9

Now, we can see the test is failing for x=5

>       assert my_func(5) == 6
E       assert 3125 == 6
E        +  where 3125 = my_func(5)

This is the case when the border values really matters and we want to be sure the developer is councious of this change (they will see the tests failing). This such case can be for example the tax applied (we don’t want to change the VAT value for a country too often, right?) or the maximum number of connected devices (eg: Netflix).

If we can argue the values are not so sensitive we could import directly the constants and use them instead of hardcoding in the test (eg: the time when the weekely report email is sent to the team’s PM and someone change it by mistake from 5PM to 5AM).

Why there are so many wrong strategies?

Because we try to avoid the hard work necessary to create a good strategy it is the main reason why there are so many cases of bad strategies.

The reason why we avoid to work on a better strategy is the difficulty to take decisions. A bad strategy is, in many situations, the consequence of the incapacity of the leaders to take a decision at the right moment.

More in Good Strategy Bad Strategy: The Difference and Why It Matters