diff --git a/_posts/2023-08-05-All-Roads-Lead-To-Kdb:-Technical-Counterpart.html b/_posts/2023-08-05-All-Roads-Lead-To-Kdb:-Technical-Counterpart.html new file mode 100644 index 0000000..3c28d7b --- /dev/null +++ b/_posts/2023-08-05-All-Roads-Lead-To-Kdb:-Technical-Counterpart.html @@ -0,0 +1,3304 @@ +--- +layout: post +date: 2023-09-15 10:00:00.000000000 +02:00 +type: post +parent_id: '0' +published: true +password: '' +status: publish +categories: [] +tags: [] +image: /img/relojes-blandos.jpg +meta: + _wpcom_is_markdown: '1' + _edit_last: '35663564' + geo_public: '0' + _publicize_job_id: '9647437577' +author: Oscar Nydza Nicpoñ, Marcos Vázquez Martín +permalink: "/2023/09/15/all-roads-lead-to-kdb-the-technical-counterpart" +--- + + + + + + + + + + + + +
+

All Roads Lead + to Kdb: the technical counterpart

+

This post serves as a follow-up to our prior article, All + Roads Lead to PyKX, where we introduced the utility of the PyKX + library from the perspective of Emma Monad, the CTO of a large fictional + company known as Mad Flow. In addition to this high-level perspective, + the focus of this post is on Python programmers who are eager to + explore the technical intricacies that were left unexplained in the + aforementioned article. Consequently, the post will provide them with a + highly detailed, step-by-step example of how to migrate an existing + Python codebase into PyKX. Similarly to the outcomes presented in the + post Accelerating + Python Workflows using PyKX, which we highly recommend reading, we + will observe a significant performance advantage of the resulting PyKX + code compared to the initial pandas implementation. More references are + available in the bibliography at the end of the post.

+

The structure of the post will be as follows:

+ +

The initial section regarding the use case is independent of + programming languages and is primarily included for reference purposes. + If you're eager to delve directly into the code and begin learning how + to migrate pure pandas-based Python code into PyKX, you can proceed to + the second section now and revisit the first section as necessary.

+

Use Case

+

With the aim of predicting traffic congestion in the presence of + rain, Mad Flow dedicated significant effort to preparing and integrating + weather and traffic data into an LSTM model. This endeavor aligned with + other studies + that sought to forecast traffic patterns using LSTM models based on air + pollution. Just like findings from studies in cities such as Manchester, + Shenzhen and + Belgrade, + our results project a weekday traffic volume increase of 5-15% during + peak hours in the presence of rain. +

+

Like in most projects related to smart cities, the used data is + notable diverse, so it needed a lot of preparatory work. The following + sections will detail the data sources, the cleansing and interpretation + processes, as well as the used model.

+

Data sources

+

The Madrid City + Council provides weather and traffic data, including both real-time + and historical records. For the purpose of model training, only the + historical data is pertinent. These datasets, formatted as CSV files, + are categorized by months spanning from the year 2018 to the + present.

+
+ 🔍 You can access the different datasets required to run this notebook from the following links: + + +
+

Traffic data

+

The traffic sensors are strategically positioned at traffic lights + throughout the city of Madrid. The collected data gather together + diverse measurements of road conditions, including speed and traffic + volume. All of these metrics are encapsulated in the load + metric, quantified in percentage terms to denote levels of congestion. + The format of the traffic + table is presented as follows:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
traffic_stationelement typeintensityocupationloadvmederrorintegration perioddate
01001M3032409059N52022-12-01 00:00:00
11001M3032409059N52022-12-01 00:15:00
21001M303001066N52022-12-01 00:30:00
+

This table amass information from various traffic stations, detailing + elements such as traffic intensity, occupancy, congestion load, and + other relevant data, all correlated with specific dates and times.

+

The table also includes a column for the date and another column that + identifies the sensor. This identification will be used to establish a + link with its corresponding coordinates by utilizing the subsequent traffic + stations table:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
element typedistrictidcodenameutm_xutm_yLongitudeLatitude
0"URB"43840"01001""Jose Ortega y Gasset E-O - Pº + Castellana-Serrano"4416154.47577e+06-3.6883240.4305
1"URB"43841"01002""Jose Ortega y Gasset O-E - Serrano-Pº + Castellana"4417064.47577e+06-3.6872640.4305
2"URB"13842"01003""Pº Recoletos N-S - Almirante-Prim"4413194.47484e+06-3.6917340.4221
+

Weather data

+

The table below provides a snapshot of the type of information + available in the weather + dataset:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
weather_stationmagnitudeyearmonthdayH01V01...
01088120221222270N...
110882202212229N...
2108832022122294.7N...
310886202212221031N...
410887202212222N...
+

This dataset contains hourly meteorological observations, including + temperature, humidity, wind speed, and precipitation measurements.

+

Each row documents various details about a specific meteorological + station:

+ +

Furthermore, we will require the geographical coordinates of the + various stations, which are available in a distinct table + provided by the Madrid City Council.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
weather_stationLongitudeLatitude
04-3.7122640.4239
18-3.6823240.4216
216-3.6392440.44
318-3.7318440.3948
4108-3.71788140.47663
+

Data Cleansing + +

+

Upon loading the data, the subsequent task involved its preparation + for analysis and modeling. The eventual consolidation of this cleaned + data into a unified table will occur during a subsequent phase.

+

After loading the data, the next step was to get it ready for + analysis and modeling. Given the diverse nature of the datasets, the + process of data cleansing will be conducted individually for each + dataset. We'll bring all this cleaned data together into a unified table + in a later phase.

+

Traffic Data

+

The process employed to clean the traffic dataset involves the + following steps:

+
    +
  1. +

    Initially, values characterized by measurement errors are + excluded from consideration. The documentation designates these values + using the symbol "N".

    +
  2. +
  3. +

    Subsequently, solely the load measurements, which constitute the + focus of our analysis, are retained.

    +
  4. +
  5. +

    The data is then grouped based on each date and station, + subsequently computing the average load value for each group:

    +
  6. +
+ + + + + + + + + + + + + + + + + + + + + + + + + +
datetraffic_stationtraffic_load
2022-12-01 00:00:0010010
2022-12-01 00:00:0010020
2022-12-01 00:00:0010030
+

Weather Data

+

Concerning the weather dataset, the procedures undertaken were as + follows:

+
    +
  1. +

    In the weather dataset, the date is split into separate columns + for day, month, and year. However, in the traffic dataset, this + information is combined. So, we merged the columns to create a single + 'date' column with the format:" day-month-year.

    +
  2. +
  3. +

    Additionally, given that each individual row contains all 24 + daily measurements taken at a weather station, along with their + respective magnitudes, we need a transformation where each hourly + measurement value and its corresponding validity status are discrete + entries within separate rows. This presents an opportunity to remove any + measurements that are marked as invalid.

    +
  4. +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
hourvalue
00 days 00:00:007.1
10 days 00:00:006.8
20 days 00:00:002.8
30 days 00:00:005.6
+
    +
  1. +

    To standardize the time measurement format and align it with the + traffic table, we merge the date with the time. As a result, the updated + date format + becomes:date-month-year hour:minute:second

    +
  2. +
  3. +

    Lastly, we restructure the diverse types of measurements into + distinct columns, enhancing the organizational coherence of the + dataset:

    +
  4. +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
dateweather_stationdirectionhumiditypressurerainfallsolartemperaturewind
02022-12-01 00:00:004000007.10
12022-12-01 00:00:0080670009.40
22022-12-01 00:00:00160730008.90
+

Location Data

+

To make it easier to combine the tables we mentioned, we need to + connect the traffic sensors with the weather stations. To do this, we + can use a distance matrix that helps us find the closest pairs of both + types of stations. Instead of directly measuring the distance using + coordinates, we decided to go with the Haversine + distance. This method calculates the distance in meters between + coordinate pairs, which gives a clearer idea of the distances.

+

This visualization is best shown using a heatmap, where the distances + are displayed on a range from 0 to 20 kilometers:

+
+ +
Figure 1: Heatmap of distances in km between weather and traffic + stations
+
+

If we look at the shortest distances between each type of station, we + end up with a mapping between these two tables:

+ + + + + + + + + + + + + + + + + + + + + + + + + +
traffic_stationweather_station
03840109
13841109
238428
+

The Final Table + +

+

After we've done all the setup for the three tables - weather, + traffic, and distance - we can now join them together. Because the + weather table has data every hour, while the traffic data is available + every 15 minutes, we'll combine them using an 'asof' join method. Then, + we'll add time and day-of-the-week details to the dataset, which will + help us study how these factors are connected to traffic congestion.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
datetraffic_stationloadDistanceClosestweather_stationdirectionhumiditypressurerainfallsolartemperaturewindweekdayhour
02022-12-01 00:00:001001015180670009.4030
12022-12-01 00:00:001002015180670009.4030
22022-12-01 00:00:001003014180670009.4030
32022-12-01 00:00:001006015180670009.4030
42022-12-01 00:00:00100901404000007.1030
+

Data interpretation + +

+

Traffic patterns exhibit a pronounced dependency on time. To dig + deeper into the data, a filtering process will be applied to extract + instances of peak traffic. This focused dataset will help us really get + a better grip on traffic dynamics.

+

The next figures illustrates the outstanding seasonality within the + dataset: + + + + + +
+
+ +
Figure 2: Load per Hour +
+
+
+
+ +
Figure 3: Load per + Weekday
+
+
+

+

It's clear that weekdays have more traffic compared to holidays. + Likewise, during the daytime, there's a higher traffic flow than at + nighttime.

+

Concerning the relationship between Rainfall and Load, preliminary + observations indicate a limited occurrence of rainy days within our + dataset:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
rainfall
count841068
mean0.0130932
std0.194232
min0
50%0
90%0
99.9%2.9
max10.9
+

After looking closely at the percentiles of the precipitation column, + it's clear that recorded instances of rainfall are quite scarce. To + address this limitation, the measurements were divided into distinct + categories based on the intensity of the rainfall. This led to a + separate analysis for datasets involving heavy rain, moderate rain, and + times when there was no rain. This analysis was done hourly, to minimize + the impact of time on traffic patterns.

+

The results of this analysis show that on average, traffic congestion + increases by around 5% to 14% during rainy hours. This consistently + matches the findings discussed in the Use Case + section.

+
+ +
Figure 4: The average traffic load per hour for measurements + categorized as heavy rain (blue), moderate rain (brown), and no rain (green):
+
+

To confirm the statistical importance of the differences between + these groups, we conducted an analysis of variance (ANOVA) test. + Notably, the ANOVA results strongly suggest significant variations in + load levels among the different rain intensity categories for all hours. + For example, let's take a look at the results for hour 12.

+ + + + + + + + + + + + + + + + + + + + + + + + + + +
sum_sqdfFPR(>F)
C(rainfall)37650.7620.11441.3346e-23
Residual8.01363e+0625687nannan
+

We chose to include rainfall as a factor in the model. However, the + other meteorological conditions didn't show similarly clear results, so + we decided to leave them out of the model.

+

The Model

+

For the purpose of performance evaluation, a preliminary model was + developed employing a basic Long Short-Term Memory (LSTM) architecture + with a memory span of 5 steps. The model's objective is to forecast the + load at a specific traffic station. The model input encompasses + historical load data, rainfall measurements, as well as the hour and day + of the week.

+

Preliminary results derived from a single station exhibit promising + outcomes, as demonstrated in Figure 6. In this figure, we compare the + actual traffic load with the predictions made by the LSTM model. + Furthermore, an analysis of the training and validation loss curves + (Figure 5) indicates that the model is not experiencing issues related + to overfitting or underfitting.

+ + + + + +
+
+ +
Figure 5: Train vs Validation loss curves plot
+
+
+
+ +
Figure 6: Traffic Forecasting for a traffic station +
+
+
+
+
+

PyKX migration + +

+
+
+

Despite the promising forecasting results yielded by the LSTM model, + certain considerations arose concerning the efficiency of the Python + code employed for project implementation. After profiling the entirity + of the process, we found 4 key areas in which the code was behaving + worse than expected:

+ + + + + + + + + + + + + + + + + + + + + + + + + +
pandas Time
Cleaning Weather247 ms ± 10.4 ms
Cleaning Traffic25.5 s ± 1.29 s
Join Final Table7.1 s ± 168 ms
Model Ingestion Preprocess3.2 s ± 54.2 ms
+

At this point, a fundamental decision had to be made regarding the migration strategy. Let's delve into the two + alternatives at our disposal, along with their positive and negative aspects:

+ + +

Let's get started!

+
+ 🔍 You can find on Github the Original + Python Project that will be migrated into PyKX. +
+
+
+

First of all we need to install and import PyKX:

+
+
+
+
!pip install pykx
+
+
+
+
+
import pykx as kx
+
+
+
+
+ 🔍 A license is required to use some of the following features. You can find more information in the + PyKX installation documentation. +
+
+
+

Datasets

+
+
+

An excellent starting point for the migration process involves transferring our data to the q environment. + We can even revert these objects to pandas and reuse all our existing code. This approach ensures that our + data remains stored within the kdb environment, thus benefitting from its rapid and scalable database + capabilities. + However, it's important to acknowledge that we might sacrifice the processing power of kdb+/q. As a result, we + will + proceed with deeper steps in the migration process.

+

Before continuing, be sure to download the datasets required for running the next pieces of + code.

+
+
+

Traffic

+
+
+

The preprocessing of the traffic table was one of the most critical + parts in terms of time. Later on, we will showcase the improvement in + execution time compared to our pure pandas implementation.

+

The data loading will be executed employing the utilities facilitated + by PyKX:

+
+
+
+
traffic = kx.q.read.csv('../12-2022.csv', types="IPSIIIISI", delimiter=';')
+
+
+
+
+
+    Further Information + on PyKX Read/Write Utils +
+
+

+ PyKX provides an array of functions designed to facilitate data loading and writing tasks. These functions + encompass the capability to handle diverse file types, spanning both general formats and those specific to the + q language. Notably, the q-specific functionalities enable seamless transfer of tables between the q context + and the Python context, offering a bidirectional exchange of data.

+

+ Specifically, the parameters of the read.csv function adhere to conventional standards. The + file's URL and + delimiter are specified. It is important to emphasize the types parameter, which expects the q types associated with each column. +

+
+
+
+
+
+
print(traffic)
+
+
+
+ id fecha tipo_elem intensidad ocupacion carga vmed .. +-----------------------------------------------------------------------------.. +1001 2022.12.01D00:00:00.000000000 M30 3240 9 0 59 .. +1001 2022.12.01D00:15:00.000000000 M30 3240 9 0 59 .. +1001 2022.12.01D00:30:00.000000000 M30 300 1 0 66 .. +1001 2022.12.01D00:45:00.000000000 M30 300 1 0 66 .. +1001 2022.12.01D01:00:00.000000000 M30 300 1 0 66 .. +1001 2022.12.01D01:15:00.000000000 M30 300 1 0 66 .. +1001 2022.12.01D01:30:00.000000000 M30 300 1 0 66 .. +1001 2022.12.01D01:45:00.000000000 M30 300 1 0 66 .. +1001 2022.12.01D02:00:00.000000000 M30 300 1 0 66 .. +1001 2022.12.01D02:15:00.000000000 M30 300 1 0 66 .. +1001 2022.12.01D02:30:00.000000000 M30 300 1 0 66 .. +1001 2022.12.01D02:45:00.000000000 M30 300 1 0 66 .. +1001 2022.12.01D03:00:00.000000000 M30 300 1 0 66 .. +1001 2022.12.01D03:15:00.000000000 M30 300 1 0 66 .. +1001 2022.12.01D03:30:00.000000000 M30 300 1 0 66 .. +1001 2022.12.01D03:45:00.000000000 M30 300 1 0 66 .. +1001 2022.12.01D04:00:00.000000000 M30 300 1 0 66 .. +1001 2022.12.01D04:15:00.000000000 M30 300 1 0 66 .. +1001 2022.12.01D04:30:00.000000000 M30 300 1 0 66 .. +1001 2022.12.01D04:45:00.000000000 M30 300 1 0 66 .. +.. +
+
+

As a result of this process, our table is now a PyKX object:

+
+
+
+
type(traffic)
+
+
+
+ pykx.wrappers.Table +
+
+

Let's examine the initial few rows of this table:

+
+
+
+
print(traffic[:3])
+
+
+
+ id fecha tipo_elem intensidad ocupacion carga vmed .. +-----------------------------------------------------------------------------.. +1001 2022.12.01D00:00:00.000000000 M30 3240 9 0 59 .. +1001 2022.12.01D00:15:00.000000000 M30 3240 9 0 59 .. +1001 2022.12.01D00:30:00.000000000 M30 300 1 0 66 .. +
+
+
+
+    Further Information + on pythonic indexing in PyKX +
+
+

+ Accessing data within PyKX objects, be it lists or tables, follows a methodology analogous to that of NumPy or + pandas. This facilitates the indexing of PyKX objects without necessitating the explicit utilization of q + functions. Furthermore, the capacity to index by columns is an additional convenience offered by this + approach.

+ >>> print(weather["H01"][:3]) +
270 9 94.7

+
+
+
+
+

We have the kx.q.qsql interface integrated in PyKX, + which allows us to query into q tables concisely.

+
+
+
+
+    Further Information + on qSQL and SQL querys +
+
+

+ PyKX allows you to use qSQL queries using API functions. This includes select, exec, + update and delete functions, which share some common characteristics, mainly with + the arguments they receive. The first three share roughly this function call structure: +

+ kx.q.qsql.{function}({tab}, columns=..., where=..., by=...) +
+

+ The columns argument expects either a list of strings or a dictionary where the key is the column + name and the value is the actual value you want in this column if let's say you want to apply a function to + it. Let's see an example:

+ >>> print(kx.q.qsql.select(weather, {"magnitude": "count distinct magnitude"}, by=["weather_station"])[:3]) +
weather_station| magnitude + ---------------| --------- + 4 | 1 + 8 | 2 + 16 | 2 +
+
+
+

+ But if you are more familiarized with the q environment it is also possible to use q functions. This approach + reduces the verbosity of our functions compared to the equivalent in Python. +

+ >>> print(kx.q("{select count distinct magnitude by weather_station from x}", weather)[:3]) +
weather_station| magnitude + ---------------| --------- + 4 | 1 + 8 | 2 + 16 | 2 +
+
+
+

+ PyKX lets you use ANSI SQL queries too!

+ >>> print(kx.q.sql("SELECT weather_station, COUNT(DISTINCT(magnitude)) FROM $ 1 GROUP BY weather_station", weather)[:3]) +
weather_station magnitude + ------------------------- + 4 1 + 8 2 + 16 2 +
+
+
+
+
+
+

To preprocess the traffic table our objective is to ascertain the + average load based on date and season, while eliminating measurement + errors. The power of qSQL enables us to accomplish this feat through a + singular query:

+
+
+
+
traffic = kx.q.qsql.select(traffic,
+                         columns = {'traffic_load': 'avg carga'},
+                         by = {"date":'fecha', "traffic_station": 'id'}, 
+                         where = "error=`N")
+
+
+
+
+
print(kx.q("3#",traffic))
+
+
+
+ date traffic_station| traffic_load +---------------------------------------------| ------------ +2022.12.01D00:00:00.000000000 1001 | 0 +2022.12.01D00:00:00.000000000 1002 | 0 +2022.12.01D00:00:00.000000000 1003 | 0 +
+
+
+
+    pandas alternative: + Traffic Cleaning +
+
+

+ Although it may look like a simple query, it is performing a seriously heavy operation. The original pandas + implementation looked like this:

+ >>> traffic = traffic[traffic["error"] == "N"].rename(columns={"carga":"load", "id":"traffic_station"}) + >>> traffic['date'] = pd.to_datetime(traffic['fecha'], errors='coerce') + >>> traffic.drop(["tipo_elem", "error", "periodo_integracion", "fecha", "intensidad", "ocupacion", "vmed"], axis=1, inplace=True) + >>> traffic.groupby(["date", 'traffic_station']).mean().reset_index() +
+

+ We observe a noticeable improvement in code legibility, especially once you have familiarised a little bit + with the qSQL syntax. The code looks sleeker and simpler using PyKX. We noticed a roughly 10x reduction in + execution time:

+ + + + + + + + + + + +
pandas TimePyKX Time
Cleaning Traffic25.5 s ± 1.29 s1.65 s ± 248 ms

+

+ This time we achieved both simpler and faster code, which we can agree can be considered as a success.

+
+
+
+
+

Keyed table indexing is different; that's why if we want to see the + first values of the table, we need to use the # + (take) operator.

+
+
+
+
print(kx.q("3#",traffic))
+
+
+
+ date traffic_station| traffic_load +---------------------------------------------| ------------ +2022.12.01D00:00:00.000000000 1001 | 0 +2022.12.01D00:00:00.000000000 1002 | 0 +2022.12.01D00:00:00.000000000 1003 | 0 +
+
+
+
+    Further Information + on Using NumPy and pandas +
+
+

+ For individuals who are still acclimatizing to the kdb+/q ecosystem, a partial adoption of NumPy's + functionality remains accessible. Specifically universal functions. By using this type of + function, the average q function that was employed in the previous query can be rephrased as follows:

+ >>> import numpy as np + >>> + >>> def npmean(arr): + >>> return np.add.reduce(arr) / len(arr) + >>> + >>> print(npmean(traffic["load"])) +
5.4

+

+ While the ability to reuse NumPy functions inside q is really nice and can be of great help during a migration + like the one we are exemplifying, we found that we were not able to use this NumPy function on our + kx.q.qsql() query. After executing the previous code, our query would look something like this: +

+ + >>> kx.q["npmean"] = npmean + >>> traffic = kx.q.qsql.select(traffic, + columns = {'traffic_load': 'npmean carga'}, + by = {"date":'fecha', "traffic_station": 'id'}, + where = "error=`N") +
+

+ Notice the function called to perform the average of the traffic_load column is the one defined + earlier. Even though we didn't get any errors, this resulted in our code running for over 20 minutes with no + feedback until we eventually stopped it manually, so we can't recommend the usage NumPy functions inside a + qSQL query like we did. We suspect it may have something to do with q's avg function (and all of + q's functions) being optimised for this kind of usages and NumPy's implementation not being ready to deal with + how kdb+/q implements its tables. It may also have something to do with the group by clause, + which creates a keyed table on q, but we can't confirm it as of now. +

+

+ On the other hand, pandas can seamlessly interface with PyKX objects through the pandas API. This can be + effortlessly achieved by importing NumPy and pandas and toggling a designated flag. We can try to replicate + the previous select:

+ >>> import os + >>> os.environ['PYKX_ENABLE_PANDAS_API'] = 'true' + >>> import numpy as np + >>> import pandas as pd + >>> + >>> print(traffic.iloc[traffic["error"] == "N"][["date", "traffic_station","load"]].head()) +
fecha id carga +---------------------------------------- +2022.12.01D00:00:00.000000000 1001 0 +2022.12.01D00:15:00.000000000 1001 0 +2022.12.01D00:30:00.000000000 1001 0 +2022.12.01D00:45:00.000000000 1001 0 +2022.12.01D01:00:00.000000000 1001 0 +
+
+

+ However, it's worth noting that the pandas API is currently under development, hence not all of pandas + functions have been fully incorporated yet. And unfortunately, groupby is one of them. We hope + that in the future we can migrate our pandas code to PyKX without any changes.

+
+
+
+
+

Weather

+
+
+

This table, serving as the traffic table, will also be imported into + the Python environment. Consequently, it becomes accessible as a Python + object, albeit not within q scopes.

+
+
+
+
weather = kx.q.read.csv('../dic_meteo22.csv', types='I'*4 + '*'*4 + 'FS'*24, delimiter=';')
+
+
+
+

To display a table in markdown format, we can transfer it to + pandas:

+
+
+
+
weather[:3].pd()
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
PROVINCIAMUNICIPIOESTACIONMAGNITUDPUNTO_MUESTREOANOMESDIAH01V01...H20V20H21V21H22V22H23V23H24V24
0287910881b'28079108_81_98'b'2022'b'12'b'22'270.0N...218.0V228.0V227.83V213.67V233.83V
1287910882b'28079108_82_98'b'2022'b'12'b'22'9.0N...10.0V10.0V9.00V9.00V8.00V
2287910883b'28079108_83_98'b'2022'b'12'b'22'94.7N...86.7V91.4V93.80V96.30V98.70V
+
+
+
+    Further Information + on Python/Q transformations +
+
+

+ Objects from q can be converted to pandas with .pd(), to PyArrow with .pa(), to + NumPy with .np() and to Python with .py() methods. This flexibility empowers Python + developers, especially those new to PyKX, to seamlessly tap into the capabilities of kdb+ databases while + acquainting themselves with q. +

+
+
+
+

As is evident, these objects are not currently within the q memory + space. Let's proceed to explore how we can access and use q + features on these objects. First, we will examine the straightforward + xcol function, which affords us the ability to rename + columns: +

+
+
+
+
weather = kx.q.xcol({'ANO': 'year', 'MES': 'month', 'DIA': 'day', 'ESTACION':'weather_station', 'MAGNITUD':'magnitude'}, weather)
+
+
+
+
+
+    Further Information + on using q functions in PyKX +
+
+

+ A plethora of q functions can be accessed via the pykx.q interface. The attributes mirrored by + these functions closely resemble the parameters anticipated by the corresponding q functions, with Python data + structures being employed in lieu of q structures. It is important to note that these functions are compiled + and thoroughly elucidated within the PyKX + documentation

+ >>> kx.q.distinct(['A', 'B', 'B', 'B' ,'C']) + `A`B`C
+

+ For those who want to dig deeper into kdb+/q and gain experience, you can use q functions and pass PyKX + objects as arguments:

+ >>> kx.q("distinct", ['A', 'B', 'B', 'B' ,'C']) + `A`B`C
+

+ It's important to emphasize that in the preceding function, a Python object is being passed to a q function. + When Python objects have a direct equivalent in q, like dictionaries, they can be directly employed as + attributes for PyKX functions. Beneath the surface, PyKX adeptly manages the conversion to q data structures. +

+

+ Moreover, the application of q iterations could be + another avenue of exploration:

+ >>> kx.q("lower").each(['A', 'B', 'C']) + `a`b`c
+

+ Finally, you can even create functions and use them with PyKX or Python objects.

+ >>> kx.q("{u !(sum x=) each u:distinct x}", ['A', 'B', 'B', 'B' ,'C']) + A| 1 + B| 3 + C| 1
+
+
+
+
+

With the following code we have removed several columns that are not + relevant to the analysis.

+
+
+
+
weather = kx.q.qsql.delete(weather, ['PUNTO_MUESTREO', 'PROVINCIA', 'MUNICIPIO'])
+
+
+
+

The next step involves merging the year, month, and day columns into + a single unified date column. To initiate this process, we begin by + accessing these three individual columns. This can be accomplished + through indexing:

+
+
+
+
print(weather["year", "month", "day"])
+
+
+
+ "2022" "2022" "2022" "2022" "2022" "2022" "2022" "2022" "2022" "2022" "2022" .. +"12" "12" "12" "12" "12" "12" "12" "12" "12" "12" "12" .. +"22" "22" "22" "22" "22" "22" "22" "01" "02" "03" "04" .. +
+
+

We observe that the outcome consists of three lists, each containing + data corresponding to the sample size. The objective is to form a single + list of the sample size, wherein the three date elements are + combined:

+
+
+
+
print(kx.q.flip(weather["year", "month", "day"])[:3])
+
+
+
+ "2022" "12" "22" +"2022" "12" "22" +"2022" "12" "22" +
+
+

We appear to be approaching the desired outcome. Currently, we + possess a list of sample size, wherein each position contains a sub-list + comprising three elements: the day, the month, and the year. To + consolidate each sub-list into a singular, unified element, the + each iterator can be used: +

+
+
+
+
print(kx.q.each(kx.q.raze, kx.q.flip(weather["year", "month", "day"]))[:3])
+
+
+
+ "20221222" +"20221222" +"20221222" +
+
+

The final step entails converting the resultant data from string + format to a date format. However, it's worth noting that certain + functions, particularly the overloaded glyphs, have yet to be + implemented. For example cast ($), take (#), + concat (,)... So we are forced to abandon the pythonic way + of calling q functions and perform this casting writing kdb+/q code + using the pykx.q() method:

+
+
+
+
date = kx.q('"D"$',(kx.q.each(kx.q.raze, kx.q.flip(weather["year", "month", "day"]))))
+print(date[:3])
+
+
+
+ 2022.12.22 2022.12.22 2022.12.22 +
+
+

Finally, we add this column to our table:

+
+
+
+
weather_ = kx.q.qsql.update(weather, columns = {'date': date})
+
+
+
+
+
+    pandas Alternative: + Time Join +
+
+

+ In pandas, we achieved this by executing this operation on our table:

+ >>> pd.to_datetime(weather[["year", "month", "day"]]) +
+

+ It seems to be less complicated thanks to having a function that does exactly what we need, whereas in q we + had to implement this function ourselves.

+
+
+
+
+

Shortly after, some team members started using q code instead of PyKX + functions in the pythonic way, as they found the resulting code to be + sleeker and more concise. As their familiarity with q grew, they put + forth a suggestion to create a function entirely written in q.

+
+
+
+
weather = kx.q.qsql.update(weather, columns = {'date':'"D"$ raze each flip(year;month;day)'})
+
+
+
+

Once again, both solutions are entirely compatible and can even be + combined. It's at the programmer's discretion to opt for either + approach.

+

Now, the three columns that have already been incorporated into the + date can be eliminated:

+
+
+
+
weather = kx.q.qsql.delete(weather, ['year', 'month', 'day'])
+
+
+
+

The current state of the weather table is as follows:

+
+
+
+
weather[:3].pd()
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
weather_stationmagnitudeH01V01H02V02H03V03H04V04...V20H21V21H22V22H23V23H24V24date
010881270.0N252.0N216.0N242.0N...V228.0V227.83V213.67V233.83V2022-12-22
1108829.0N8.0N9.0N8.0N...V10.0V9.00V9.00V8.00V2022-12-22
21088394.7N97.6N96.6N97.5N...V91.4V93.80V96.30V98.70V2022-12-22
+
+
+

Now, let's shift our focus towards deconstructing the H* and V* + queries into multiple rows, while simultaneously introducing a time + column to prevent information gaps. In q, the conventional approach + would involve leveraging functional qSQL to extract columns conforming + to the aforementioned patterns. However, we will capitalize on the + advantage that PyKX offers by incorporating q code via strings to + sidestep this method:

+
+
+
+
def functionalSearch(cols, pattern, func):
+    xcols = cols[kx.q.where(kx.q.like(cols, pattern))]
+    xstring = func.format(kx.q.sv(b";", kx.q.string(xcols)).py().decode("utf-8"))
+    return xcols, xstring
+
+
+
+

The function above accepts a list of columns, a designated pattern + for searching, and a q function represented as a string. This function + takes the columns found following the specified pattern in qSQL format + (where columns are accessed by their names, not symbols) as an argument. + When applied to all columns commencing with "H", it yields these columns + as a vector of symbols, alongside a string representation of these + columns in qSQL format:

+
+
+
+
cols = kx.q.cols(weather)
+found_columns, qsql_function = functionalSearch(cols, b'H*', "{}")
+
+print("Columns found: ", found_columns)
+print("qSQL function: ", qsql_function)
+
+
+
+ Columns found: `H01`H02`H03`H04`H05`H06`H07`H08`H09`H10`H11`H12`H13`H14`H15`H16`H17`H18`H19`.. +qSQL function: H01;H02;H03;H04;H05;H06;H07;H08;H09;H10;H11;H12;H13;H14;H15;H16;H17;H18;H19;H20;H21;H22;H23;H24 +
+
+

This capability is remarkably powerful, as it enables the utilization + of qSQL alongside variables, obviating the need for functional forms + which can often prove intricate for individuals new to kdb+/q + development.

+

Now, let's apply the above methodology to the columns commencing with + H, which correspond to the measurement values, and to + the columns commencing with V, which indicate the + validity of the measurements. The function employed to transform the + measurement column into rows is flip: +

+
+
+
+
hcols, value = functionalSearch(cols, b'H*', "flip({})")
+vcols, valid = functionalSearch(cols, b'V*', "flip({})")
+
+
+
+

Our next step involves feeding our internally constructed functions, + represented as strings, into the qSQL update function. This + is done in conjunction with the repetition of the 24-hour sequence to + match the initial number of rows:

+
+
+
+
weather = kx.q.qsql.update(weather, columns = {'hour': 'count[i]#enlist 01:00*til 24', 'values': value, 'valid': valid})
+
+
+
+

To remove columns beginning with H or + V, we can employ the same approach as earlier to + circumvent the need for functional qSQL: +

+
+
+
+
weather = kx.q.qsql.delete(weather, columns = kx.q.raze(hcols,vcols).py())
+
+
+
+

Let's see the result:

+
+
+
+
weather[:3].pd()
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
weather_stationmagnitudedatehourvaluesvalid
0108812022-12-22[0 minutes, 60 minutes, 120 minutes, 180 minut...[270.0, 252.0, 216.0, 242.0, 239.0, 246.0, 233...[N, N, N, N, N, N, N, N, N, N, V, V, V, V, V, ...
1108822022-12-22[0 minutes, 60 minutes, 120 minutes, 180 minut...[9.0, 8.0, 9.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, ...[N, N, N, N, N, N, N, N, N, N, V, V, V, V, V, ...
2108832022-12-22[0 minutes, 60 minutes, 120 minutes, 180 minut...[94.7, 97.6, 96.6, 97.5, 97.5, 98.2, 98.8, 98....[N, N, N, N, N, N, N, N, N, N, V, V, V, V, V, ...
+
+
+

Finally, the remaining task involves expanding the table so that each + element within the lists corresponds to a distinct row:

+
+
+
+
weather = kx.q.ungroup(weather)
+
+
+
+

We can further streamline the table by eliminating rows that lack + validity and merging the date with the time:

+
+
+
+
weather = kx.q.qsql.select(weather, where = 'valid=`V')
+weather = kx.q.qsql.update(weather, columns = {'date': 'date+hour'})
+weather = kx.q.qsql.delete(weather, columns = ["valid", "hour"])
+weather[:3].pd()
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
weather_stationmagnitudedatevalues
0108812022-12-22 10:00:00263.00
1108812022-12-22 11:00:00247.83
2108812022-12-22 12:00:00215.83
+
+
+

The magnitude column plays a pivotal role in + attributing meaning to the value. As outlined by the + dataset creators, the diverse magnitudes align with the elements + encapsulated within the following dictionary:

+
+
+
+
magnitude = {80:"ultraviolet", 
+             81:"wind", 
+             82:"direction", 
+             83:"temperature", 
+             86:"humidity", 
+             87:"pressure", 
+             88:"solar", 
+             89:"rainfall"}
+
+
+
+

We just have to change the key by the value of the dictionary.

+
+
+
+
weather = kx.q('{update magnitude: x magnitude from y}', magnitude, weather)
+
+
+
+

Finally, all we have to do is separate the different weather + conditions into different columns:

+
+
+
+
weather = kx.q('{exec (value x)#magnitude!values by date,weather_station from y}',magnitude,weather)
+
+
+
+

The exec query provides us with the capability to + transform a dictionary, stored within a column, into multiple columns. + In this transformation, the keys become the column names, and the values + constitute the data within those respective columns. This proves + particularly beneficial when dealing with a dictionary that represents + the weather conditions of each entry, linked with their respective + values. By implementing this approach and subsequently grouping the + data, we effectively disseminate the weather conditions across distinct + columns for each weather entry and weather station.

+
+
+
+
+    pandas Alternative: + Weather Cleaning +
+
+

+ This turned out to be a complex migration, since on pandas this "flipping" functionality is provided by + melt: +

+ >>> weather_hour = weather.melt(id_vars=["weather_station", "magnitud", "date"], + value_vars=[x for x in weather.columns if re.match("^H", x)], + var_name="hour") + >>> + >>> weather_valid = weather.melt(value_vars=[x for x in weather.columns if re.match("^V", x)], var_name="valid") +
+

+ As for the subsequent operations, those turned more alike to the original pandas implementation:

+ >>> weather= weather_hour[weather_valid["value"] == "V"].reset_index() + >>> + >>> weather["hour"] = weather["hour"].str[1:] + >>> weather["hour"] = pd.to_timedelta(weather['hour'].astype(int)-1, unit='h') + >>> weather["date"] = weather["date"] + weather["hour"] + >>> + >>> weather_values = weather.assign(MAGNITUD = weather["magnitud"].map({80:"ultraviolet", + 81:"wind", + 82:"direction", + 83:"temperature", + 86:"humidity", + 87:"pressure", + 88:"solar", + 89:"rainfall"})) \ + .MAGNITUD.str.get_dummies() \ + .multiply(weather["value"], axis="index") + >>> + >>> weather = pd.concat([weather, weather_values], axis=1) + >>> weather = weather.drop(["index", "hour", "magnitud", "value"], axis=1) + >>> weather = weather.groupby(["date", "weather_station"]).sum().reset_index() +
+

+ All in all, this table proved to be quite tricky with the migration, but nothing unmanageable. The rest of the + migration was far simpler. Overall, we noticed again a 10x decrease in execution time, which we consider + something remarkable.

+ + + + + + + + + + + +
pandas TimePyKX Time
Cleaning Weather247 ms ± 10.4 ms26.7 ms ± 683 µs

+
+
+
+
+

Location

+
+
+

Both traffic and weather station tables will be loaded into the q + memory space:

+
+
+
+
kx.q["weather_station"] = kx.q(".Q.id", kx.q.read.csv('../Estaciones_control_datos_meteorologicos.csv', types=" IFF", delimiter=";", as_table=True))
+kx.q["traffic_station"] = kx.q.read.csv('../pmed_ubicacion_12-2022.csv', types = "SII**FFFF", delimiter = ";", as_table=True)
+
+
+
+

We are now able to access these objects within q functions without + the necessity of passing them as PyKX or Python objects. To illustrate, + let's proceed to modify the column names in both tables in order to + establish a uniform naming convention:

+
+
+
+
kx.q("weather_station:(`CDIGO_CORTO`LONGITUD`LATITUD!`weather_station`longitude`latitude) xcol weather_station")
+_=kx.q("traffic_station:(`id`longitud`latitud!`traffic_station`longitude`latitude) xcol traffic_station")
+
+
+
+
+
+    Further Information + on Using Q + memory space +
+
+

+ If you feel more comfortable programming in q, you have the option to operate within the q memory space. PyKX + objects can be seamlessly transferred into the q memory space, where you can manipulate them as if you were + operating within a q ecosystem, employing the following code: kx.q["table"]. Once you've + completed your operations, you can effortlessly bring them back to the Python memory space by returning them + using q code: kx.q("table")

+
+
+
+
+

Our objective is to merge these two tables. Currently, there appears + to be no identifier that readily facilitates a conventional join. + Nevertheless, it's worth noting that both the weather and traffic + stations are situated by coordinates. We can exploit the spatial + proximity between stations designated for measuring traffic and weather + to facilitate the join. To compute the distance between two sets of + coordinates, the Harvesine + distance as previously discussed, can be employed. It's important to + acknowledge that while this distance function is readily available in + Python, it is not natively accessible within q.

+
+
+
+
pip install haversine
+
+
+
+

One potential approach would involve re-implementing the Haversine + distance function in q. However, this might become impractical for more + complex libraries. Alternatively, although slower, we could transmit our + q objects to Python and work with them there. Nonetheless, it's strongly + recommended to avoid switching objects between q and Python. However, + the data we had to move between memory spaces wasn't very large (and we + were careful about this) and we didn't see any noticeable drop in + performance.

+

The features we've elucidated earlier, allowing us to transition + between Python and q objects, empower us to temporarily reuse Python + code. This is particularly pertinent given the scale of tables we are + currently dealing with. To incorporate our q objects into this function, + we can use certain PyKX tools to convert them into Python + objects:

+
+
+
+
from haversine import haversine_vector, Unit
+dist = kx.toq(
+            haversine_vector(kx.q('`longitude`latitude # weather_station').pd(), 
+                             kx.q('`longitude`latitude # traffic_station').pd(),
+                             Unit.KILOMETERS, comb=True))
+
+
+
+

We've reached a point where we have a matrix detailing the distance + in kilometers for every combination of traffic and weather stations. Our + upcoming task is to pinpoint pairs of station identifiers that exhibit + the minimum distance:

+
+
+
+
ids = kx.q.each(kx.q('{first where x=min x}'), dist)
+distance_table = kx.q('{traffic_station ^ weather_station[x]}' ,  ids)
+distance_table = kx.q.qsql.delete(distance_table, columns = ['tipo_elem','distrito','cod_cent','nombre','utm_x','utm_y','longitude', 'latitude'])
+
+
+
+

With this we have a mapping that relates every traffic station to its + nearest weather station:

+
+
+
+
distance_table[:3].pd()
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + +
traffic_stationweather_station
03840109
13841109
238428
+
+
+

Final Table

+
+
+

Integrating the three tables is a relatively straightforward process. + The distances table can be seamlessly merged with either of the other + two using a simple left join. However, when joining the traffic and + weather tables, an asof join (aj) is necessary due to their + disparate time intervals. To conclude, two columns, time and day of the + week, should be appended to furnish the model with the data's inherent + seasonality:

+
+
+
+
complete = kx.q.lj(traffic, kx.q.xkey('traffic_station', distance_table))
+complete = kx.q.aj(kx.toq(['date','weather_station']), complete, weather)
+complete = kx.q.qsql.update(kx.q("0^",complete),  {"hour":"`hh$date", "weekday":'("d"$date)mod 7'})
+
+
+
+

Let's look at this last table:

+
+
+
+
kx.q("5#",complete).pd()
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
traffic_loadweather_stationultravioletwinddirectiontemperaturehumiditypressuresolarrainfallhourweekday
datetraffic_station
2022-12-0110010.0240.00.54239.06.084.0946.01.00.005
10020.0240.00.54239.06.084.0946.01.00.005
10030.0240.00.54239.06.084.0946.01.00.005
10060.0240.00.54239.06.084.0946.01.00.005
10090.01100.00.352.07.979.0937.01.00.005
+
+
+
+
+    pandas Alternative: + Final table +
+
+

+ This is another bottleneck we encountered on our profiling. On pandas, the code looked kind of similar, with a + simple join and an asof join:

+ >>> complete = traffic.merge(distance_table, on=["traffic_station"], how="inner") +>>> complete = pd.merge_asof(complete.sort_values(["date"]), weather, on='date', by=["weather_station"]) +
+

+ However we once again noticed a 10x reduction in execution time:

+ + + + + + + + + + + +
pandas TimePyKX Time
Join Final Table7.1 s ± 168 ms686 ms ± 24.1 ms

+

+ This improvement is most likely due to performance benefits when using kdb, as it's optimized for time series + data and the asof join, which is where most of this performance gain came from.

+
+
+
+
+

Model

+
+
+

For the model's input, we chose to cherry-pick only the essential + columns. Moreover, we apply normalization to the rainfall column using a + straightforward MinMax scaler. This function can be included within the + q memory space, ready for use whenever necessary:

+
+
+
+
kx.q("minMaxScale:{[l] {(x-y)%(z-y)}[;min l;max l]l}")
+                  
+final = kx.q.qsql.select(complete, columns = {"date": "date",
+                                              "traffic_station":"traffic_station",
+                                              "hour":"hour", 
+                                              "weekday": "weekday", 
+                                              "traffic_load": "traffic_load%100",
+                                              "rainfall":"minMaxScale rainfall"}
+                                    )
+
+
+
+

Throughout this transition from pandas, the primary challenge emerged + while migrating the time_window function, given its + reliance on loops. Our approach involved first comprehending the input + data, defining the desired output, and then formulating an idiomatic q + implementation rather than a direct 1:1 migration. This method proved + more time-efficient.

+

In this scenario, our input consisted of a table, and our desired + output was a list of matrices for each station. To facilitate this + process, we devised multiple functions that proved invaluable:

+
+
+ +
+
+
+
_=kx.q("""sw:{({y#z _x}[x;y;]')til count b:y _x}""")
+
+
+
+ +
+
+
+
_=kx.q("""gt:{y _(flip x)[z]}""") # gets target (in position z)
+
+
+
+ +
+
+
+
_=kx.q("""toMatrix:{({[t;i]value t[i]}[x;]')til count x:flip x}""") # table to matrix
+
+
+
+ +
+
+
+
_=kx.q("""
+        prepareData:{[data; ntest; chunkLen; columns; locTarget]  
+            train:(toMatrix')?[data;();`traffic_station;columns!({(y;(-;(count;x);z);x)}[;_;ntest]')columns]; 
+            test:(toMatrix')?[data;();`traffic_station;columns!({(y;(-;(count;x);z);x)}[;#;ntest]')columns];                                                                               
+            (((sw[;chunkLen]')test;(gt[;chunkLen;locTarget]')test);((sw[;chunkLen]')train;(gt[;chunkLen;locTarget]')train))   
+        }
+    """)
+
+
+
+

Lets test this function in action with only one station:

+
+
+
+
import numpy as np
+
+station_id = 4010
+
+station = kx.q.qsql.select(final, where=["traffic_station="+str(station_id)])
+
+data = kx.q("prepareData", station, 500, 5, kx.SymbolVector(['rainfall', 'traffic_load', 'hour', 'weekday']), 1)
+
+X_train, y_train = np.array(data[0][0][station_id].py()), np.array(data[0][1][station_id].py())
+X_test, y_test =  np.array(data[1][0][station_id].py()), np.array(data[1][1][station_id].py())
+
+
+
+
+
+    pandas Alternative: + Model Ingestion +
+
+

+ This is the last bottleneck we ran into while doing our profiling. We used the Python Sklearn MinMax scaler + for this.

>>> from sklearn.preprocessing import MinMaxScaler + >>> final_table["load"]/=100 + >>> final_table["rainfall"] = MinMaxScaler().fit_transform(final_table["rainfall"]) +
+

+ We also created a custom function to make those 5-step back chunks for a particular station:

+ + >>>
# Table Index: Load -> 2, Rainfall -> 7, Hour -> 11, Weekday -> 12
+ >>> + >>>
# Assign to each traffic station an array with the target column and the training information
+ >>> + >>> train = final_table.groupby('traffic_station').apply(lambda x: np.array(x[:500])[:,[2,7,11,12]].astype(float)) + >>> test = final_table.groupby('traffic_station').apply(lambda x: np.array(x[500:])[:,[2,7,11,12]].astype(float)) + >>> + >>>
# Crete 5 step back chunks
+ >>> def time_window(traffic_station, dataset, look_back=5): + >>> data_X, data_y= [], [] + >>> station_data = dataset[traffic_station] + >>> for i in range(len(station_data)-look_back-1): + >>> data_X.append(station_data[i:(i+look_back)]) + >>> data_y.append(station_data[i+look_back+1, 2]) + >>> return np.array(data_X), np.array(data_y) + >>> + >>> train_X, train_y = create_dataset(train) + >>> test_X, test_y = create_dataset(test) +

+

+ In this final test, we once again notice a huge time improvement:

+ + + + + + + + + + + +
pandas TimePyKX Time
Model Ingestion Preprocess3.2 s ± 54.2 ms178 ms ± 8.01 ms

+
+
+

That wraps up our migration to PyKX. Next, we proceed to confirm that + the model is working as intended with the data we acquired through + PyKX.

+
+
+

Firstly, we install Tensorflow:

+
+
+
+
pip install tensorflow
+
+
+
+

And now we create and train a simple LSTM model:

+
+
+
+
from keras.models import Sequential
+from keras.layers import Dense,LSTM 
+
+model = Sequential()
+model.add(LSTM(units = 50, return_sequences=False, input_shape=[None,4]))
+model.add(Dense(units = 1))
+model.compile(loss='mae', optimizer='adam')
+
+def fit(train_X, train_y, test_X, test_y):
+    return model.fit(train_X, train_y, 
+                    epochs=50, batch_size=8, 
+                    validation_data=(test_X, test_y), 
+                    verbose=0, shuffle=False)
+
+
+def predict(data):
+    return model.predict(data, verbose=0)
+
+
+
+

Now we can observe the loss curves for both training and validation + over epochs.

+
+
+
+
import matplotlib.pyplot as plt
+
+history = fit(X_train,y_train,X_test,y_test)
+
+plt.plot(history.history['loss'], label='train')
+plt.plot(history.history['val_loss'], label='validation')
+
+plt.title("Train and Validation Loss Curves")
+plt.xlabel("Epochs")
+plt.ylabel("Loss")
+plt.legend()
+plt.show()
+
+
+
+ +
+
+

And finally, we can see the performance of the model in the following + graph:

+
+
+
+
plt.plot(y_test, label='test real')
+plt.plot(range(200,495), model.predict(X_test[200:], verbose=0).flatten(), label='test predict')
+plt.title("Real test vs Predicted test")
+plt.xlabel("Time(15 min)")
+plt.ylabel("Load")
+plt.legend(loc="upper right")
+plt.show()
+
+
+
+ +
+
+

Performance gains + +

+
+
+

As discussed earlier, all 4 bottlenecks were solved by migrating to + the kdb+/q environment taking advantage of PyKX. Overall, the final + picture looks like this:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
pandas Time PyKX Time
Cleaning Weather247 ms ± 10.4 ms26.7 ms ± 683 µs
Cleaning Traffic25.5 s ± 1.29 s 1.65 s ± 248 ms
Join Final Table7.1 s ± 168 ms 686 ms ± 24.1 ms
Model Ingestion Preprocess3.2 s ± 54.2 ms178 ms ± 8.01 ms
+

So we got a 10x decrease across the board. We went from a total of 36 + seconds spent on these bottlenecks down to just shy of 4 seconds, which + is really impressive on itself. If a migration like this one was on the + horizon, we would strongly suggest profiling the code to see where it + struggles the most and try to think of a way to migrate those parts + alone to PyKX. Then, as a second step, we recommend to port the rest of + the code to PyKX in order to circumvent the overhead that moving data + between memory spaces inherently adds to the process.

+
+
+
+ 🔍 You can find on Github the resulting PyKX-enhanced + pythonic code. +
+
+
+

pykx.q migration + +

+
+
+

After gaining proficiency in the q language, the team progressively + shifted towards utilizing kx.q() extensively, drawn by its + brevity. Yet, this approach introduced a level of monotony due to the + requirement of employing strings. As a remedy, a decision was made to + transition into a q environment. However, as highlighted in the previous + chapter, certain Python code proved challenging to migrate to q. + Consequently, the team chose to remain within PyKX, this time operating + within a q environment.

+

pykx.q facilitates the execution of Python code within a q + environment, thereby presenting new avenues for addressing existing + codebases. This flexibility enables the incorporation and utilization of + Python libraries, both those installed on the system and those available + as .py files.

+

In our scenario, we harness both of these options. We emphasize the + potential provided by these opportunities to integrate Python code + within pykx.q, while also providing a link to the project's q + script for readers seeking further insights.

+
+ 🔍 You can find on Github the resulting PyKX-enhanced + kdb+/q code. +
+

Our initial step involves executing the haversine_vector + function for calculating coordinate distances. Given its unavailability + in q, we opted to employ pykx.q to seamlessly integrate this library + directly into our q code. This was achieved through the following + lines:

+
+
.pykx.pyexec"from haversine import haversine_vector, Unit";
+
+
+
+    Further Information + on Python and Q Contextdas +
+
+

+ In PyKX.q, the python and q contexts coexists too. The utilization of .pykx.pyexec permits the + execution of Python code within the q environment. When you're in the q context, you can work with python + objects using the resources provided by pykx.q.These incluede indexing, function declaration, and library + importation. However, consistent with the recommendation of the previous chapter, it's generally better to + stick with q objects for as long as you can.

+ +

+ Objects from the Python memory can be retrieved using pykx.get and transmitted using + pykx.set. The Python data type to be employed in the transformation can be specified using + .pykx.setdefault. +

+
+
+

This function expects two pandas DataFrames as input, so we need to + change the default conversion type from NumPy to pandas:

+
+
.pykx.setdefault"pd";
+
+

Having done this, we can "move" our input variables to the Python + memory space using .pykx.set

+
+
.pykx.set[`a;`longitude`latitude#a];
+.pykx.set[`b;`longitude`latitude#b];
+
+

And finally execute our function:

+
+
(.pykx.eval"haversine_vector(a, b, Unit.KILOMETERS, comb=True)")`
+
+

Notice the backtick at the end, this is for converting back to a q + type.

+

The other way we can run Python code is to load a .py (renamed to .p) + file using \l. This could be done as follows:

+
+
system"l kerasmodel.p";
+
+

Here we have included the functions fit and + predict previously defined in last section. We can load + them and use them like this: +

+
+
modelfit:.pykx.get`fit;
+modelfit[train[0][3403];train[1][3403];test[0][3403];test[1][3403]];
+modelpredict:.pykx.get`predict;
+res:modelpredict[train[0][3403]];
+
+

In terms of performance, we noticed a slight decrease in execution + time comparing to using PyKX:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
pandas Time PyKX Time q Time
Cleaning Weather247 ms ± 10.4 ms26.7 ms ± 683 µs22.3 ms ± 224 µs
Cleaning Traffic25.5 s ± 1.29 s 1.65 s ± 248 ms 1.627 s ± 124 ms
Join Final Table7.1 s ± 168 ms 686 ms ± 24.1 ms543 ms ± 10.1 ms
Model Ingestion Preprocess3.2 s ± 54.2 ms178 ms ± 8.01 ms153 ms ± 3.12 ms
+

These little timing shifts might happen because of changes in memory + or processor activity, but it's not really noticeable as far as we could + see.

+
+
+

Final + thoughts

+

Overall this project turned out smoother than expected thanks to + PyKX. It allowed us to get a foothold on the kdb+/q ecosystem before + making the full transition to using q code, which on itself it's + impressive. We achieved our goals of reducing execution time, as we saw + throughout this post, and learned a lot about the kdb+/q ecosystem and + its technologies.

+

It wasn't all smooth and sail though. For instance, we hit a + fundamental obstacle when using the pandas API. In an ideal world, the + transition from pandas to PyKX using this API would be as simple as + importing PyKX, enabling a flag and getting the input tables as PyKX + objects. However, since we relied on operations such as + group_by and melt, it ended up being + unfeasible since these operations were not yet implemented and we would + have to manually code them, which would take a long time on itself. We + should note, however, that this feature is still on beta, so we look + forward to future improvements in this regard since it would make + migrations like this one much simpler once it becomes a drop-in + replacement for pandas calls. +

+

In summary, with the experience we gained we dare to recommend you + following these steps as a PyKX migration guide:

+
    +
  1. Migrate the original data to a kdb+/q environment or PyKX objects.
  2. +
  3. Profile the original code to locate bottlenecks. This allows us to + put the focus on the heavier parts of our process.
  4. +
  5. Once located, migrate those parts alone to PyKX. You + will take a performance hit when moving data from + memory spaces, but it's important to know the potential gains that are + possible, so measure the migrated bottlenecks alone to see the + difference. +
  6. +
  7. If your tables are not that large, you may be able to get away with + it as-is, not needing to do a full migration. If that's the case then + great! You are done. However, if you notice that the performance hit + when moving data between memory spaces is actually hindering the + process, consider a full migration to PyKX.
  8. +
  9. If a full migration to PyKX is needed, then first take a look at the + pandas API. By the time you read this, it may have already improved + compatibility and could be a drop-in replacement for pandas. If it's not + the case you will need to familiarise yourself with PyKX and get your + hands dirty as we had to.
  10. +
+

We hope this post could serve as a guide of sorts for those that want + to familiarise themselves with PyKX and the kdb+/q ecosystem since it + brings a lot to the table. In the end we were able to achieve a full 1:1 + migration to PyKX and even to q with pykx.q, which is impressive given + the differences between the languages and their philosophy all while + notably cutting execution time. Below you will find some additional + references to continue learning PyKX. Have fun with PyKX!

+
+
+

Bibliography

+ +
+ + + \ No newline at end of file diff --git a/assets/2023/09/15/132_0.png b/assets/2023/09/15/132_0.png new file mode 100644 index 0000000..559334e Binary files /dev/null and b/assets/2023/09/15/132_0.png differ diff --git a/assets/2023/09/15/134_0.png b/assets/2023/09/15/134_0.png new file mode 100644 index 0000000..81fadaa Binary files /dev/null and b/assets/2023/09/15/134_0.png differ diff --git a/assets/2023/09/15/fc_p.png b/assets/2023/09/15/fc_p.png new file mode 100644 index 0000000..899376b Binary files /dev/null and b/assets/2023/09/15/fc_p.png differ diff --git a/assets/2023/09/15/heatmap3.png b/assets/2023/09/15/heatmap3.png new file mode 100644 index 0000000..219c829 Binary files /dev/null and b/assets/2023/09/15/heatmap3.png differ diff --git a/assets/2023/09/15/loadperhour.png b/assets/2023/09/15/loadperhour.png new file mode 100644 index 0000000..98218d4 Binary files /dev/null and b/assets/2023/09/15/loadperhour.png differ diff --git a/assets/2023/09/15/loadperweekday.png b/assets/2023/09/15/loadperweekday.png new file mode 100644 index 0000000..a4b6d78 Binary files /dev/null and b/assets/2023/09/15/loadperweekday.png differ diff --git a/assets/2023/09/15/loss_graph_p.png b/assets/2023/09/15/loss_graph_p.png new file mode 100644 index 0000000..acde664 Binary files /dev/null and b/assets/2023/09/15/loss_graph_p.png differ diff --git a/assets/2023/09/15/loss_python.png b/assets/2023/09/15/loss_python.png new file mode 100644 index 0000000..fbe0740 Binary files /dev/null and b/assets/2023/09/15/loss_python.png differ diff --git a/assets/2023/09/15/rainfall.png b/assets/2023/09/15/rainfall.png new file mode 100644 index 0000000..49affec Binary files /dev/null and b/assets/2023/09/15/rainfall.png differ diff --git a/assets/2023/09/15/test_python.png b/assets/2023/09/15/test_python.png new file mode 100644 index 0000000..50f38d2 Binary files /dev/null and b/assets/2023/09/15/test_python.png differ diff --git a/img/relojes-blandos.jpg b/img/relojes-blandos.jpg new file mode 100644 index 0000000..9a62666 Binary files /dev/null and b/img/relojes-blandos.jpg differ