Classifying UK property types with chDB and scikit-learn
Machine learning prototyping often involves juggling datasets, preprocessing steps, and performance constraints, which can make the process both complex and time-consuming. scikit-learn is one of Python's most popular and powerful libraries for machine learning, offering a large set of utilities for building and evaluating models. In this guide, we will explore how chDB complements scikit-learn in the model development life cycle, by building a binary classifier that predicts whether a property is a flat or a detached house, based on features like price, location, and date.
Prerequisites:
Before starting, ensure you have:
- Python 3.8 or higher installed
- A virtual environment activated (recommended)
- Basic understanding of SQL and machine learning concepts
- A ClickHouse Cloud service or trial account
What we'll be building
The UK property market features diverse property types, each with distinct characteristics. Two of the most common types are flats and detached houses. Flats are typically self-contained housing units within larger buildings, often found in urban areas with higher density. They generally cost less and have smaller floor areas. Detached houses are standalone properties with no shared walls, usually found in suburban or rural areas. They typically command higher prices and offer more privacy and space.
Our goal in this guide is to build a binary classifier that can predict is_flat = 1 (flat) or is_flat = 0 (detached house) from property transaction features like price, location, and date.
By focusing only on these two property types, we can build a model that learns to distinguish between these fundamentally different housing options.
We'll be using the UK Property Price dataset which is available on ClickHouse Cloud as one of the starter datasets that you can import in a few easy clicks. It contains pricing data for properties which were sold for in the United Kingdom from 1995 to 2024.
Setup
Add the dataset to your Cloud service
To add this dataset to an existing ClickHouse Cloud service, login to console.clickhouse.cloud with your account details.
In the left hand menu, click on Data sources. Then click Add sample data:
Select Get started in the UK property price paid data (4GB) card:
Then click Import dataset:
ClickHouse will automatically create the pp_complete table in the default database and fill the table with 28.92 million rows of price point data.
In order to reduce the likelihood of exposing your credentials, we recommend you add your Cloud username and password as environment variables on your local machine. From a terminal run the following command to add your username and password as environment variables:
Set up your credentials
The environment variables above persist only as long as your terminal session. To set them permanently, add them to your shell configuration file.
Why environment variables? Storing credentials in environment variables rather than hardcoding them in your notebook:
- Prevents accidentally committing passwords to version control
- Makes it easy to switch between development and production environments
- Follows security best practices
Setup a Marimo notebook
Activate your python virtual environment and install the following packages:
We will be using Marimo as a notebook to explore our data and develop our ML model. Run the following command to start a new notebook:
A new browser window should open with the Marimo interface on localhost:2718:
Marimo notebooks are stored as pure Python files, making them easy to version control and share with others.
In your first cell, import the necessary libraries:
In this guide we'll be making use of the following libraries:
chdb: an embedded ClickHouse SQL engine for data processingpandasandnumpy: a familiar data-science tool for data manipulation and numerical operationsscikit-learn: a Machine learning tools (model, evaluation metrics) librarymatplotlibandseaborn: popular libraries for visualizationmarimo: reactive notebook componentsimbalanced-learn: provides tools for handling imbalanced datasets
Import data from Cloud
With our environment variables set, our data imported into our Cloud service, and our notebook up and running, we'll next load the data into our notebook for processing.
ClickHouse's remoteSecure function allows you to easily retrieve the data from ClickHouse Cloud.
You can instruct chDB to return this data in process as a Pandas data frame - which is a convenient and familiar way of working with data.
Run the following query in a new cell to fetch the UK price paid data from your ClickHouse Cloud service and turn it into a pandas.DataFrame:
The query above will load 11.8 million records from Cloud.
Make sure your Cloud service is not in an idle state when you run the cell above.
You can run df_raw.head() in a new cell to take a look at the first few rows of the data.
In the code snippet above we defined our SQL query as a templated string, passing in host, user and password as variables which hold our environment variables.
We used the remoteSecure() table engine to connect to ClickHouse Cloud over a secure connection (TLS encryption) and selected the columns we are interested in.
The WHERE clause filters data server side before transfer—removing invalid prices, null locations, and keeping only flats (type = 'F') and detached houses (type = 'D'). This server-side filtering is much faster than transferring everything and filtering locally.
We also created a target variable is_flat using CASE WHEN type = 'F' THEN 1 ELSE 0 END AS is_flat, which creates a binary label (1 for flat, 0 for detached house).
FORMAT Arrow tells chDB to return data in the Apache Arrow format.
We chose Apache Arrow as it is a columnar data format that provides 3-10x faster data transfer compared to CSV.
ClickHouse offers the ability to work with over a hundred different file formats including familiar ones like CSV, TSV and JSON.
The chdb.query(query, "ArrowTable") call tells chdb to execute the query we defined in String query and to return a PyArrow Table object as the result.
We then use the to_pandas() function, to convert the data to a Pandas data frame which is a convenient and familiar way of working with the data.
You'll see later in this guide how you can seamlessly transition between processing data using Pandas and using chdb depending on the task at hand.
Data exploration
Before building our model, let's use chDB to understand how flats differ from detached houses.
Compare price distributions
Run the code below in a new cell to generate price distribution statistics for flats vs detached houses.
| Property Type | Count | Average Price | Median Price | Q25 Price | Q75 Price | Min Price | Max Price |
|---|---|---|---|---|---|---|---|
| Flat | 5,204,810 | £204,500 | £144,720 | £85,250 | £232,500 | £1 | £160,000,000 |
| Detached | 6,666,708 | £296,000 | £234,950 | £145,000 | £360,000 | £1 | £53,490,000 |
In the query_stats string above we made use of the Python table engine.
This table engine allows us to read the data into chdb from the df_raw DataFrame which was defined in the previous cell.
We then used ClickHouse's built-in SQL capabilities and just a few of the many aggregate functions to aggregate and transform the data, calculating statistics like the average using AVG and the 25th percentile, median and 75th percentile using the quantiles function, as well as the min and max values using the MIN and MAX functions.
We used GROUP BY is_flat to split the data into two groups for comparison.
We're looking at these particular statistics because if the median price of a detached house is significantly different from the median price of a flat, then it's likely that we'll be able to train our model easily. The 25th and 75th quartiles show us the "typical range" of the price distribution—if these don't overlap much between property types, then price alone is a strong signal. Min/Max show us outliers in our data.
Visualizing the price difference between flats and detached houses
Run the following code in a new cell to visualize the distribution:
The distributions reveal three key differences that indicate we could build an effective classifier:
- significant shift in central tendency between the two property types
- different distribution shapes and peak frequencies
- detached houses exhibit a much more substantial high-price tail
Details
First, there is a significant shift in central tendency between the two property types. Detached houses typically have a higher median price compared to flats, reflecting their larger size, privacy, and standalone nature. This difference in typical property price provides a simple yet powerful signal that classifiers can leverage.
Second, the distributions show different shapes and peak frequencies. Flats tend to be concentrated at the lower end of the price spectrum, with their peak occurring in the more affordable price ranges. In contrast, detached houses have a distribution that peaks at higher price points and extends further into premium price territory. This separation means that price alone can be a strong indicator of property type.
Lastly, detached houses exhibit a much more substantial high-price tail in their distribution. While both property types have outliers, detached houses maintain a noticeably thicker tail extending to much higher prices. Properties in these higher price brackets are more likely to be detached houses, which offer more space, land, and privacy—features that command premium prices in the market.
Geographic distribution
We'll again use chDB to explore how flats and detached houses are distributed across different localities.
In the query below we make use of the Python table engine to read our DataFrame, transform the data using ClickHouse, and return the result back as a DataFrame.
This query reveals how the composition of property types varies by locality. Urban areas like London tend to have a much higher percentage of flats, while more rural or suburban localities have a higher proportion of detached houses. This geographic variation is important because location becomes another useful signal for our classifier—the locality where a property is located can help predict whether it's more likely to be a flat or a detached house.
Feature engineering with chDB
Now we'll create features for our machine learning model. Feature engineering is where we transform raw data into inputs the model can learn from effectively.
Creating time-based features
The query above creates a feature dataset by extracting and transforming multiple variables from the raw property data.
Starting with the original columns like is_flat, price, and location fields, the query adds several time-based and derived features to help the classification model identify patterns.
Time-based features are extracted from the transaction date to capture temporal patterns in the property market.
- The
yearextraction is important because property prices fluctuate over time due to changing market conditions. - The
monthfeature captures seasonal patterns, as the property market sees more activity during spring and summer. - The
quarteridentifies broader economic cycles, while the day of the week (1 for Monday, 7 for Sunday) reflects that most transactions occur on weekdays. - The
years_since_2010variable provides a continuous time representation that machine learning models can more easily incorporate into their predictions.
The query also engineers a logarithmic transformation of the price variable, which offers several advantages for modeling. It reduces the influence of extreme outliers, makes the price distribution more normal, and treats proportional changes equally—so the jump from £100,000 to £200,000 is modeled similarly to £500,000 to £1,000,000. Finally, a unique record ID is assigned to each row, making it easy to trace predictions back to their original records.
One of chDB's key advantages is that all these transformations happen together in a single SQL statement, which executes much faster than performing each transformation sequentially in pandas. This efficiency is especially valuable when working with large datasets like ours.
Encoding categorical variables
Machine learning models need numerical inputs, but we have text values like town names ("LONDON", "BIRMINGHAM", "MANCHESTER"), district names, and locality names.
We'll need to convert these to numbers first.
While we could use a function like LabelEncoder from Scikit learn, we'll use an approach called label encoding implemented through reference tables (also called dimension tables in data warehousing).
Create reference tables
In a new cell, run the code shown below:
Let's break this query down step by step, starting with the inner select statement:
Here we use DISTINCT to get only unique column names (removing duplicates) and include WHERE ... IS NOT NULL to exclude missing values.
For column town, for example, this will return a list of unique towns such as "LONDON", "BIRMINGHAM", "MANCHESTER", etc.
The outer query, shown above, then uses that data and assigns a sequential numbers to each row (to each unique town) using row_number() OVER (ORDER BY {column_name}) where row_number is a window function that numbers rows: 1, 2, 3, 4, etc.
We use ORDER BY {column_name} to order alphabetically.
Using the town column for example, will result in BIRMINGHAM=1, LEEDS=2, LONDON=3, MANCHESTER=4 etc.
Finally, we subtract one so that IDs start from 0, which is a common ML convention.
This produces a mapping table where each unique categorical variable is assigned a unique numeric variable. An example is shown below:
| town | town_id |
|---|---|
| BIRMINGHAM | 0 |
| BRISTOL | 1 |
| LEEDS | 2 |
| LIVERPOOL | 3 |
| LONDON | 4 |
| MANCHESTER | 5 |
Create reference tables for all categorical features
We can now use the create_reference_table function to generate reference tables for the features of interest.
Note that we don't encode the property type since it's already captured in our binary target variable is_flat.
This produces a unique numerical mapping for each of the categorical variables.
Apply the mappings
Now we replace text values with numeric IDs using these reference tables with the
zip function to pair up columns and the dict function to convert pairs to a dictionary:
We can then apply the mappings:
The table below shows an example of how our features looked before applying the encoding:
| record_id | town | district | locality | price | is_flat |
|---|---|---|---|---|---|
| 1 | ABBOTS LANGLEY | ABERCONWY | ABBERTON | 450000 | 1 |
| 2 | ABERAERON | ADUR | AB KETTLEBY | 180000 | 1 |
| 3 | ABERDARE | ALLERDALE | ABBERD | 520000 | 0 |
| 4 | ABERDOVEY | ALNWICK | ABBERLEY | 320000 | 0 |
| 5 | ABERGAVENNY | ALYN AND DEESIDE | ABBERTON | 275000 | 1 |
After encoding, they look like this:
| record_id | town | town_id | district | district_id | locality | locality_id | price | is_flat |
|---|---|---|---|---|---|---|---|---|
| 1 | ABBOTS LANGLEY | 0 | ABERCONWY | 0 | ABBERTON | 4 | 450000 | 1 |
| 2 | ABERAERON | 1 | ADUR | 1 | AB KETTLEBY | 1 | 180000 | 1 |
| 3 | ABERDARE | 2 | ALLERDALE | 2 | ABBERD | 2 | 520000 | 0 |
| 4 | ABERDOVEY | 3 | ALNWICK | 3 | ABBERLEY | 3 | 320000 | 0 |
| 5 | ABERGAVENNY | 4 | ALYN AND DEESIDE | 4 | ABBERTON | 4 | 275000 | 1 |
Clean the data
Most ML models can't handle NaN (missing) values so we either need to fill them or remove them.
For this demonstration, we will remove them using the dropna function:
Model training
Now that we have numerical features, it's time to build our classifier.
Select feature columns
For this classification task, we include location features (town_id, district_id, locality_id) because geographic location is a legitimate signal for predicting property type.
Urban areas tend to have more flats, while suburban and rural areas have more detached houses.
Unlike the previous is_london task, these location features don't directly leak the answer—they provide useful context about the area's characteristics.
Split the data
Machine learning models need to be tested on data they've never seen before. We split our data into two sets:
In the code above we set test_size=0.20 to hold out 20% of the data for testing.
Our model will learn patterns from the remaining 80% of the data, the training set.
We set random_state=42as the reproducibility seed.
With this set, we get the same split every time the code is run.
We set stratify=y to maintain class proportions.
If the overall data is 60% flats and 40% detached houses, then both the training and test sets will maintain the same 60/40 split.
We do this to prevent a situation in which the training or test sets are selected in such a way that they are no longer representative of the original data.
Correct for class imbalance
If there is a significant imbalance between flats and detached houses in our dataset (for example, if one type is much more common than the other), we might want to address this imbalance. An imbalanced dataset can cause the model to be biased toward the majority class.
The simplest and most direct way to handle class imbalance is to use a technique like SMOTE (Synthetic Minority Over-sampling Technique) on the training data to create synthetic examples of the minority class, providing the model with more balanced examples to learn from.
Run the following code in a new cell to use SMOTE:
What SMOTE does
smote.fit_resample(X_train, y_train)calculates the feature space relationships in the training data.- It identifies the minority class (whichever property type is less common).
- It then creates new, synthetic property records for the minority class until both classes have equal representation, thus balancing the training set.
By training on balanced data, the Random Forest model will no longer have a bias toward the majority class, which should significantly improve the model's ability to correctly identify both flats and detached houses, leading to better overall performance metrics.
Training a random forest classifier
For our model, we'll use a Random Forest classifier, which is an ensemble of decision trees: Run the code below to train the model:
This step can take some time. We recommend grabbing a warm beverage and reading something interesting on our blog while you wait. ⏱️☕
Model evaluation
Now let's test how well our model performs on data it has never seen:
The model achieves strong performance with around 87% accuracy. Let's understand what each metric means:
- Accuracy tells us the overall percentage of correct predictions across both classes
- Precision measures how reliable our "flat" predictions are (few false positives means high precision)
- Recall measures how well we catch all actual flats (few false negatives means high recall)
- F1-Score balances precision and recall, giving us a single metric that considers both
Classification report
For a more detailed breakdown, we can generate a full classification report:
This report shows precision, recall, and F1-score for both classes (flats and detached houses), giving us insights into how well the model performs for each property type.
Confusion matrix
A confusion matrix helps us visualize exactly where our model makes mistakes:
The confusion matrix provides valuable insights:
- The darker cells on the diagonal show correct predictions
- Off-diagonal cells reveal where the model gets confused
- By examining which type of error is more common (false positives vs false negatives), we can understand the model's biases
Understanding model performance
The table below gives context to what different levels of accuracy mean for classification tasks:
| Accuracy Range | Interpretation |
|---|---|
| 50-60% | Barely better than guessing. The model didn't learn much. |
| 60-75% | There is some signal, but it is weak. |
| 75-85% | The model has good performance, and useful patterns were learned. |
| 85-95% | The model shows strong predictive patterns |
| 95%+ | The model is suspiciously performant. This might indicate data leakage or overfitting. |
For our property type classification task, we would expect between 85-95% accuracy because:
- Detached houses typically have higher prices than flats
- Geographic location correlates strongly with property type (urban areas have more flats)
- Clear separating patterns exist between the two property types
Our model's performance of ~87% accuracy with balanced precision and recall indicates that it has successfully learned to distinguish between flats and detached houses using the features we provided.
Inference with chDB
Now that we have a trained model, let's explore different approaches to make predictions on new data. This is where chDB really shines—showing multiple patterns for integrating ML with SQL workflows.
Loading the trained model
First, let's load our saved model that we trained earlier:
Single property prediction
Let's create a helper function that can predict the property type for a single property:
Batch predictions with chDB
For multiple properties, we can use chDB to handle feature engineering and then use our model for batch predictions:
Inference with ClickHouse User-Defined-functions (UDFs)
ClickHouse supports User Defined Functions (UDFs) that allow you to invoke machine learning models directly at insert or query time.
Implementation steps
Create a Python UDF Script
Create a Python script that reads data from stdin, loads your model, and outputs predictions to stdout.
Save this in ClickHouse's user_scripts directory (typically /var/lib/clickhouse/user_scripts/):
Make sure that the file is executable
Setup Steps:
-
Install required Python dependencies for the ClickHouse user:
InfoThis is the most common cause of
CHILD_WAS_NOT_EXITED_NORMALLYandModuleNotFoundErrorerrors!Verify the installation:
-
Copy the model file to the ClickHouse user_scripts directory:
-
Make the script executable:
-
Set proper ownership:
-
Test the script manually to verify it works:
This should output 0 or 1 (the prediction).
Common issues
If you get CHILD_WAS_NOT_EXITED_NORMALLY error:
- Missing dependencies: Make sure numpy and scikit-learn are installed for the user running ClickHouse:
- Model not found: Verify the model file exists and that the path is correct:
-
Permission issues: Check the script is executable and owned by clickhouse user
-
Check ClickHouse logs for stderr output from the script:
Configure the UDF
Create an XML configuration file in /etc/clickhouse-server/ (ending with _function.xml)
After creating the XML file:
- Restart ClickHouse to load the new function:
- Verify the function is registered:
You should see:
Use the UDF in SQL Queries
With the setup complete, we'll take one of the existing rows and see how our newly defined predict_is_flat
UDF performs.
We'll take the following row from our existing dataset
Where our encoding for NEW QUAY (town) is 710, CEREDIGION is 74 and NEW QUAY (locality) is 13899.
The function returns the prediction:
Inference with ClickHouse User-Defined-functions (UDFs) in Cloud (coming soon)
UDFs in Cloud are currently in private-preview. This guide will be updated soon, when executable UDFs in Cloud become general access.