From 008a38dd6db6b5bb07b5be010bddc01f38531e57 Mon Sep 17 00:00:00 2001
From: Oscar Nydza All Roads Lead
post Accelerating
Python Workflows using PyKX, which we highly recommend reading, we
will observe a significant performance advantage of the resulting PyKX
- code compared to the initial Pandas implementation. More references are
+ code compared to the initial pandas implementation. More references are
available in the bibliography at the end of the post.
The structure of the post will be as follows:
The initial section regarding the use case is independent of programming languages and is primarily included for reference purposes. If you're eager to delve directly into the code and begin learning how - to migrate pure Pandas-based Python code into PyKX, you can proceed to + to migrate pure pandas-based Python code into PyKX, you can proceed to the second section now and revisit the first section as necessary.
With the aim of predicting traffic congestion in the presence of @@ -823,7 +823,7 @@
This visualization is best shown using a heatmap, where the distances are displayed on a range from 0 to 20 kilometers:
@@ -1009,14 +1009,14 @@Partial migration: This approach involves identifying the specific points where Pandas +
Partial migration: This approach involves identifying the specific points where pandas
experiences
the greatest strain. Subsequently, these segments can be migrated to q using PyKX, while leaving the rest of
the code
intact. This alternative capitalizes on the compatibility features of PyKX, which ensure a seamless
interaction between
- pure Pandas/Numpy and PyKX. For instance, we may use the .pd()
method, which allows us to convert
+ pure pandas/NumPy and PyKX. For instance, we may use the .pd()
method, which allows us to convert
a PyPX table
- object into a Pandas dataframe. This strategy can be particularly effective if the demarcation between
+ object into a pandas dataframe. This strategy can be particularly effective if the demarcation between
computationally
demanding and less complex segments is evident. However, this isn't always the case, leading to multiple
conversions
@@ -1240,7 +1240,7 @@
A promising alternative to this intermediate transformations method is the PyKX implementation of the - Pandas API. However, + pandas API. However, as we will explore later, even this had to be discarded for our particular case.
An excellent starting point for the migration process involves transferring our data to the q environment. - We can even revert these objects to Pandas and reuse all our existing code. This approach ensures that our + We can even revert these objects to pandas and reuse all our existing code. This approach ensures that our data remains stored within the kdb environment, thus benefitting from its rapid and scalable database capabilities. However, it's important to acknowledge that we might sacrifice the processing power of kdb+/q. As a result, we @@ -1301,7 +1301,7 @@
The preprocessing of the traffic table was one of the most critical parts in terms of time. Later on, we will showcase the improvement in - execution time compared to our pure Pandas implementation.
+ execution time compared to our pure pandas implementation.The data loading will be executed employing the utilities facilitated by PyKX:
- Accessing data within PyKX objects, be it lists or tables, follows a methodology analogous to that of Numpy or - Pandas. This facilitates the indexing of PyKX objects without necessitating the explicit utilization of q + Accessing data within PyKX objects, be it lists or tables, follows a methodology analogous to that of NumPy or + pandas. This facilitates the indexing of PyKX objects without necessitating the explicit utilization of q functions. Furthermore, the capacity to index by columns is an additional convenience offered by this approach.
Traffic
- Pandas Alternative:
+ pandas alternative:
Traffic Cleaning
- Although it may look like a simple query, it is performing a seriously heavy operation. The original Pandas
+ Although it may look like a simple query, it is performing a seriously heavy operation. The original pandas
implementation looked like this:
>>> traffic = traffic[traffic["error"] == "N"].rename(columns={"carga":"load", "id":"traffic_station"})
@@ -1538,7 +1538,7 @@ Traffic
- Pandas Time
+ pandas Time
PyKX Time
@@ -1577,13 +1577,13 @@ Traffic
- For individuals who are still acclimatizing to the kdb+/q ecosystem, a partial adoption of Numpy's
+ For individuals who are still acclimatizing to the kdb+/q ecosystem, a partial adoption of NumPy's
functionality remains accessible. Specifically universal functions. By using this type of
function, the average q function that was employed in the previous query can be rephrased as follows:
@@ -1597,8 +1597,8 @@ Traffic
5.4
- While the ability to reuse numpy functions inside q is really nice and can be of great help during a migration
- like the one we are exemplifying, we found that we were not able to use this numpy function on our
+ While the ability to reuse NumPy functions inside q is really nice and can be of great help during a migration
+ like the one we are exemplifying, we found that we were not able to use this NumPy function on our
kx.q.qsql()
query. After executing the previous code, our query would look something like this:
Traffic
style="color: black; margin-top:0%; text-align: left;margin-left: 5%; margin-right: 5%; margin-bottom: 15px;line-weight: 1.5">
Notice the function called to perform the average of the traffic_load
column is the one defined
earlier. Even though we didn't get any errors, this resulted in our code running for over 20 minutes with no
- feedback until we eventually stopped it manually, so we can't recommend the usage Numpy functions inside a
+ feedback until we eventually stopped it manually, so we can't recommend the usage NumPy functions inside a
qSQL query like we did. We suspect it may have something to do with q's avg
function (and all of
- q's functions) being optimised for this kind of usages and Numpy's implementation not being ready to deal with
+ q's functions) being optimised for this kind of usages and NumPy's implementation not being ready to deal with
how kdb+/q implements its tables. It may also have something to do with the group by
clause,
which creates a keyed table on q, but we can't confirm it as of now.
- On the other hand, Pandas can seamlessly interface with PyKX objects through the Pandas API. This can be
- effortlessly achieved by importing Numpy and Pandas and toggling a designated flag. We can try to replicate
+ On the other hand, pandas can seamlessly interface with PyKX objects through the pandas API. This can be
+ effortlessly achieved by importing NumPy and pandas and toggling a designated flag. We can try to replicate
the previous select:
>>> import os
@@ -1642,9 +1642,9 @@ Traffic
- However, it's worth noting that the Pandas API is currently under development, hence not all of Pandas
+ However, it's worth noting that the pandas API is currently under development, hence not all of pandas
functions have been fully incorporated yet. And unfortunately, groupby
is one of them. We hope
- that in the future we can migrate our Pandas code to PyKX without any changes.
+ that in the future we can migrate our pandas code to PyKX without any changes.
@@ -1664,7 +1664,7 @@ Weather
To display a table in markdown format, we can transfer it to
- Pandas:
+ pandas:
@@ -1786,8 +1786,8 @@ Weather
- Objects from q can be converted to Pandas with .pd()
, to PyArrow with .pa()
, to
- Numpy with .np()
and to Python with .py()
methods. This flexibility empowers Python
+ Objects from q can be converted to pandas with .pd()
, to PyArrow with .pa()
, to
+ NumPy with .np()
and to Python with .py()
methods. This flexibility empowers Python
developers, especially those new to PyKX, to seamlessly tap into the capabilities of kdb+ databases while
acquainting themselves with q.
@@ -1952,13 +1952,13 @@ Weather
- Pandas Alternative:
+ pandas Alternative:
Time Join
- In Pandas, we achieved this by executing this operation on our table:
+ In pandas, we achieved this by executing this operation on our table:
>>> pd.to_datetime(weather[["year", "month", "day"]])
@@ -2356,13 +2356,13 @@ Weather
- Pandas Alternative:
+ pandas Alternative:
Weather Cleaning
- This turned out to be a complex migration, since on Pandas this "flipping" functionality is provided by
+ This turned out to be a complex migration, since on pandas this "flipping" functionality is provided by
melt
:
Weather
- As for the subsequent operations, those turned more alike to the original Pandas implementation:
+ As for the subsequent operations, those turned more alike to the original pandas implementation:
>>> weather= weather_hour[weather_valid["value"] == "V"].reset_index()
>>>
@@ -2405,7 +2405,7 @@ Weather
- Pandas Time
+ pandas Time
PyKX Time
@@ -2710,13 +2710,13 @@ Final Table
- Pandas Alternative:
+ pandas Alternative:
Final table
- This is another bottleneck we encountered on our profiling. On Pandas, the code looked kind of similar, with a
+ This is another bottleneck we encountered on our profiling. On pandas, the code looked kind of similar, with a
simple join and an asof join:
>>> complete = traffic.merge(distance_table, on=["traffic_station"], how="inner")
@@ -2728,7 +2728,7 @@ Final Table
- Pandas Time
+ pandas Time
PyKX Time
@@ -2768,7 +2768,7 @@ Model
- Throughout this transition from Pandas, the primary challenge emerged
+
Throughout this transition from pandas, the primary challenge emerged
while migrating the time_window
function, given its
reliance on loops. Our approach involved first comprehending the input
data, defining the desired output, and then formulating an idiomatic q
@@ -2880,7 +2880,7 @@
Model
- Pandas Alternative:
+ pandas Alternative:
Model Ingestion
@@ -2922,7 +2922,7 @@ Model
- Pandas Time
+ pandas Time
PyKX Time
@@ -2992,7 +2992,7 @@ Model
-
+
-
+
Performance gains
@@ -3025,7 +3025,7 @@ Performance gains
- Pandas Time
+ pandas Time
PyKX Time
@@ -3110,7 +3110,7 @@ pykx.q migration
Further Information
- on Python and Q Context
+ on Python and Q Contextdas
pykx.q migration
- This function expects two Pandas DataFrames as input, so we need to
- change the default conversion type from Numpy to Pandas:
+ This function expects two pandas DataFrames as input, so we need to
+ change the default conversion type from NumPy to pandas:
"pd"; .pykx.setdefault
@@ -3178,7 +3178,7 @@ pykx.q migration
- Pandas Time
+ pandas Time
PyKX Time
q Time
@@ -3224,8 +3224,8 @@ Final
throughout this post, and learned a lot about the kdb+/q ecosystem and
its technologies.
It wasn't all smooth and sail though. For instance, we hit a
- fundamental obstacle when using the Pandas API. In an ideal world, the
- transition from Pandas to PyKX using this API would be as simple as
+ fundamental obstacle when using the pandas API. In an ideal world, the
+ transition from pandas to PyKX using this API would be as simple as
importing PyKX, enabling a flag and getting the input tables as PyKX
objects. However, since we relied on operations such as
group_by
and melt
, it ended up being
@@ -3234,7 +3234,7 @@
Final
should note, however, that this feature is still on beta, so we look
forward to future improvements in this regard since it would make
migrations like this one much simpler once it becomes a drop-in
- replacement for Pandas calls.
+ replacement for pandas calls.
In summary, with the experience we gained we dare to recommend you
following these steps as a PyKX migration guide:
@@ -3254,8 +3254,8 @@ Final
when moving data between memory spaces is actually hindering the
process, consider a full migration to PyKX.
- If a full migration to PyKX is needed, then first take a look at the
- Pandas API. By the time you read this, it may have already improved
- compatibility and could be a drop-in replacement for Pandas. If it's not
+ pandas API. By the time you read this, it may have already improved
+ compatibility and could be a drop-in replacement for pandas. If it's not
the case you will need to familiarise yourself with PyKX and get your
hands dirty as we had to.
diff --git a/assets/2023/08/05/132_0.png b/assets/2023/09/15/132_0.png
similarity index 100%
rename from assets/2023/08/05/132_0.png
rename to assets/2023/09/15/132_0.png
diff --git a/assets/2023/08/05/134_0.png b/assets/2023/09/15/134_0.png
similarity index 100%
rename from assets/2023/08/05/134_0.png
rename to assets/2023/09/15/134_0.png
diff --git a/assets/2023/08/05/fc_p.png b/assets/2023/09/15/fc_p.png
similarity index 100%
rename from assets/2023/08/05/fc_p.png
rename to assets/2023/09/15/fc_p.png
diff --git a/assets/2023/08/05/heatmap3.png b/assets/2023/09/15/heatmap3.png
similarity index 100%
rename from assets/2023/08/05/heatmap3.png
rename to assets/2023/09/15/heatmap3.png
diff --git a/assets/2023/08/05/loadperhour.png b/assets/2023/09/15/loadperhour.png
similarity index 100%
rename from assets/2023/08/05/loadperhour.png
rename to assets/2023/09/15/loadperhour.png
diff --git a/assets/2023/08/05/loadperweekday.png b/assets/2023/09/15/loadperweekday.png
similarity index 100%
rename from assets/2023/08/05/loadperweekday.png
rename to assets/2023/09/15/loadperweekday.png
diff --git a/assets/2023/08/05/loss_graph_p.png b/assets/2023/09/15/loss_graph_p.png
similarity index 100%
rename from assets/2023/08/05/loss_graph_p.png
rename to assets/2023/09/15/loss_graph_p.png
diff --git a/assets/2023/08/05/loss_python.png b/assets/2023/09/15/loss_python.png
similarity index 100%
rename from assets/2023/08/05/loss_python.png
rename to assets/2023/09/15/loss_python.png
diff --git a/assets/2023/08/05/rainfall.png b/assets/2023/09/15/rainfall.png
similarity index 100%
rename from assets/2023/08/05/rainfall.png
rename to assets/2023/09/15/rainfall.png
diff --git a/assets/2023/08/05/test_python.png b/assets/2023/09/15/test_python.png
similarity index 100%
rename from assets/2023/08/05/test_python.png
rename to assets/2023/09/15/test_python.png