-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align information in relation to new architecture #270
Conversation
"source_type": "local", | ||
"source_data": "var/catalog_product.csv" | ||
} | ||
"format": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have format_type like csv, xml etc.; I do understand we will currently only support csv.
Instead of calling "csv_separator", we can say use "delimiter", which is commonly understood term. And we could also remove "csv_" from beginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://www.php.net/manual/en/function.fgetcsv.php
we can also look at some optional parameters in the link above, for instance "escape, enclosure, length"; some of them are very commonly used in file-upload scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About "csv_" - agree here, as we have type defined in URL, we can use just params as "delimiter", "enclosure" and I will add "escape", make sense.
Length I guess its not really required param in this case. Cause we are basically defining how we will read file, which parts of file, etc. I never had a real usage of "length" param in csv import workflows.
@@ -23,16 +26,15 @@ Path is relative from Magento Root folder | |||
|
|||
``` | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we think that this can be used in multi-select or multi-files upload?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. Again it was build based on practical experience. One file transfer already can be quite resource consuming. Multiple its already risky.
If customer want to have multiple files upload he has BULK API available for this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I was just thinking from UI perspective, sometimes we have multi-select option enabled for small files. But in this case it can be risky and may overcomplicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I would not stick to multiple files in MVP in any case. Especially when we have BULK API on-board. But this can be discussed in a future steps if we will get request from users
Align format fields
@@ -207,70 +160,88 @@ Will return list of Source that was uploaded before. | |||
} | |||
``` | |||
|
|||
## Start File Import Endpoint | |||
### Base CSV Parser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually in the streaming based file import process, we take the records from the data stream and process them as we are fetching, parsing and pushing those records. Is there a specific customer requirement, to separate the upload and import from each other. Because this design will inherently make the process slower, since you have to upload the file first, and then when it's all completed, only then you will start the actual import process. From customer perspective, they want to import the data into Magento as quickly as possible.
On the other hand, since uploading might involve network latency and disk read (I/O latency), it is usually time consuming process, it is usually advisable to use streaming based approach end to end; meaning since lot of time spend waiting on I/O, CPU can be better off continue processing what it has already in-hand; so as you get records you can parse and push them simultaneously.
Since ultimate import will be based on Queue Processing, it should not cause dead-lock issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a long and multiple discussion related on this topic.
Main idea of split process is to separate 1 actions that can be independent from each other.
We are talking about decomposition of processes and this approac have multiple profits.
- We have own set of imported data, so we can always see which data were uploaded, when and by whom
- Each file can be run independently, so you can restart import for a file as many times as you want, define different mapping for same file, fix some errors in your mapping or processing rules without requirements to re-upload whole file again and again
- Its is possible, and there are a cases where same file can be used for processing different import types: for example - Stock and Price can be 2 different imports but they could be located in same file
- Source - its a separate module, this mean you can just delete it. For Import its no matter where you received your file from, you can use 3rd party extension, you can create your own module that will place file in correct location and will create required entity in database. and it still will work. The same process works in another direction. You can upload file and then create any custom script that will process it. And in this case you will use all standard Magento security features for file valiation and upload.
I hope it makes sense :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we can still achieve all these points using a streaming based approach to this problem. Allow import again for a file that has been imported, by updating only import settings or format, there is no stopping in that. I have been involved in numerous data migration / ETL projects in the past, and have seen these issues many times.
Maybe we can improve this process later on after MVP, but I just want to make it clear, that this will not be performant for customers in many scenarios, and it will not be using the streaming based approach potential.
But I am absolutely okay, if you want to move forward with this approach and test the waters; then we can improve later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree!! One more point I forgot to mention is a complexity of such request. We already have an issue that for each file format we need own endpoint and own interfaces cause they have a custom format. Then also add to the body mapping information and body already become more complex and heavy.
As we are providing API for that, I believe that if somebody will integrate it with own system, he can also build in a way of one request.
I any case, agree with you that let’s try it how it will work and how the user experience will be. For now we had a good experience with this on our project. Let’s see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I am okay. For large data-sets or files, this sequential implementation never performed at the desired rate for me in the past; but we can always improve on, this is not something fundamentally right or wrong. This is based on needs, streaming based approach end to end, is much more efficient in terms of memory footprint & cpu processing, that all modern systems are using it.
"csv_delimiter": "string", | ||
"multiple_value_separator": "string" | ||
} | ||
"uuid": "UUID", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this UUID be returned by the system/api itself rather than the user? if that is true, then we can remove it from the Request, and also we can name it just an ID, rather than UUID. Since UUID is one of the strategy to generate the IDs, which can be changed in the future. We should not name the underlying strategy of the ID generation, for the name of the Entity ID. During import, I believe we are using the name, source_id; I believe we should be consistent in the naming also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UUID can be defined by the client, but at the same time IF its not set, we are generating it on Magento side, thats why this param is here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your point related to IDs naming.
Its correct idea if we are talking about system that have this pattern everywhere.
Unfortunatelly its not in Magento, so _id is everywhere in the system are Database Ids.
Same time you are right that user should not be aware of way how ID is generated.
In our implementation its explicitly UUID, cause its currently on our code level.
May I ask you that we leave it as it is for now, and will have another discussion.
Btw. you can raise this topic on next Architectural meeting so there will be a lot of people to discuss and this may define pattern for future Magento standard of IDs naming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I will raise this within architecture meeting
} | ||
"import_images_file_dir": "string" | ||
}, | ||
"convertingRules": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1- Not sure, if I can understand how to call this API, it is not clear the intend of the parameters to be honest.
2- What is the use-case of sorting while importing? usually sorting of any kind doesn't happen during data importing. Is this sorting of rules, like a sequence of rules execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats fine, its not so easy from first glance :-)
convertingRules are defining how values for import can be changed to archive our goals.
Same time its not require param.
Out of box, without this param we are supporting standard Magento CSV format.
But imaging real world where each system export their own formats. This mean we have to provide possibility to configure import source so, that customer can import any csv data structure he wants.
Here are coming convertingRules.
So we have in the code some sub-module: RulesConverter.
Module takes all data we giving him (in this case read from the source file).
Then converter apply rules for this data.
Example: In import Status == Yes/No, we have a rule that convert this to 0/1 .... and so on for all required NOT standard Magento fields.
And it returns already changed input data, but those data are ready for import to Magento.
Sort order is required, cause you may have multiple manipulation with same field, so order of execution have to be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for clarification,
"delimiter": "string", | ||
"enclosure": "string", | ||
"escape": "string", | ||
"multiple_value_separator": "string" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of this "multiple_value_separator"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example Multiselect attribute value:
"color": "Red|Green|White"
In this case - | is a multiple value separator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, so it should be multiple_value_delimiter? to match above convention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, aligned
"import_type": "catalog_product", | ||
"validation_strategy": "string", | ||
"allowed_error_count": 0, | ||
"import_image_archive": "string", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we planning to import images as well? if yes, not sure where they will be landing, it could be some CDN etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we are planning and currently they will be imported in Magento.
Magento our of box is saving all images under pub/media/
Magento API same time accepting images as base64_encoded content.
So form our side are not really care about where are the images saved, we just taking what customer giving to us, and forwarding them to Magento in base64 view.
Customer providing images by uploading them on Magento filesystem. and they will be deleted after import.
Yes, in perfect way we have to support CDN storage, and for Magento we already have some support. But for our module - its currenly I guess out of scope of MVP.
Interesting point for future, we can groom locating where User can user place images for Import
Update ConverSion rules
This one can be closed. New one will come |
Description
I aligned information about architectural vision of Asynchronous Import.
Reviewers
@naydav @paliarush