Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

categorical variables #54

Open
freebiesoft opened this issue Jul 11, 2023 · 4 comments
Open

categorical variables #54

freebiesoft opened this issue Jul 11, 2023 · 4 comments

Comments

@freebiesoft
Copy link

Hi George, I was wondering what your opinion is for using your TST model on multivariate time series datasets with categorical variables? The dataset I have in mind is encrypted network traffic. This would consist of fields such as "Timestamp", "time since last packet", "packet size", "protocol" (categorical), and various binary columns for TCP flags.

Thanks for you hard work, and will look forward to hearing from you :)

@gzerveas
Copy link
Owner

Hello @freebiesoft, someone has reported (in an archived or open issue) that they tested the TST on purely binary input data, and that it seemed to work well - so that would be an encouraging sign.
Now, when it comes to the categorical variable, I am not sure exactly what temporal dependency you are trying to encode: would the protocol change from time step to time step? Or is it rather that an entire sample would refer to the same protocol, and you simply need to tag that sample with its corresponding protocol? In the latter case, there are probably better ways of doing this than adding a special dimension with the category index. I think it would help if you would describe (or better still, show some drawing/table/etc) of what you envisage as input data.

@freebiesoft
Copy link
Author

Hi @gzerveas , thanks for the info. My data samples are TCP streams. Each of these TCP streams will belong to a particular class such as VOIP, Chat (applications), Stream, etc., full list can be seen from Service perspective or App perspective here https://user-images.githubusercontent.com/77194157/184114283-3df9b1e3-ccf5-48fa-a81b-21c9a276d606.png.

Each TCP stream contains a series of packets in time order. Each packet will have fields such as:

  • timestamp: represented as an integer that represents number of microseconds since the beginning of the TCP stream (i.e., so the first packet will always be 0)
  • direction: i.e., a packet making part of a request to the destination host or a response packet from the destination host. (bool)
  • Protocol: usually will be either TCP or TLS, where TCP represents control flow related packets, and TLS packets will mostly contain actual application data
  • length: of this packet in bytes
  • sequence number (in bytes)
  • and a few other binary columns related to TCP flags

So those fields will essentially be the columns, and each packet in the stream will be a row, and together will form one data sample.

@haotruongnhat
Copy link

Hi @freebiesoft, I am facing similar inputs as yours, but not exactly the same. Would you like to discuss about this? my email: [email protected]

@zzzten
Copy link

zzzten commented Dec 8, 2023

Hi @gzerveas,
In my dataset, I have two kinds of categorical variables, static categorical variables that do not change from time step to time step, and dynamic categorical variables that change always.
And I have written a demo processing these by adding embedding layers to model. However I want to know if it makes sense. And could you please explain what are better ways of doing this than adding a special dimension with the category index for the same protocol?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants