Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatch in the placement of ignore tags and lack of translation #93

Open
Pavelrst opened this issue Jan 24, 2024 · 2 comments
Open

Mismatch in the placement of ignore tags and lack of translation #93

Pavelrst opened this issue Jan 24, 2024 · 2 comments

Comments

@Pavelrst
Copy link

Deepl python cliet version: 1.16.1, Python 3.9.18

For a long input, some ignore tags are brought to the beginning of the output string which is not translated at all.

Input Example:

<x>sentece 1</x> This is Roger Penrose, certainly one of the great scientists of our time, <x>sentece 2</x> winner of the 2020 Nobel Prize in physics for his work Reconciling Black Holes with Einstein's general theory of relativity. <x>sentece 3</x> But back in the 1970s, Roger Penrose made a contribution to the world of mathematics and that part of mathematics known as tiling. <x>sentece 4</x> You know tiling? <x>sentece 5</x> The process of putting tiles together so that they form a particular pattern. <x>sentece 6</x> The thing that was remarkable about the pattern that Roger Penrose developed is that by using only two shapes, he constructed a pattern that could be expanded infinitely in any direction without ever repeating, <x>sentece 7</x> much like the number Pi has a decimal that isn't random, but it will go on forever without repeating. <x>sentece 8</x> In mathematics, this is a property known as aperiodicity. <x>sentece 9</x> And the notion of an aperiodic tile set using only two tiles was such a sensation, it was given the name Penrose tiling. <x>sentece 10</x>Here's Roger Penrose, now Sir Roger Penrose, <x>sentece 11</x>standing on a field of Penrose tiles. <x>sentece 12</x> Then in 2007, this man Peter Lu, who was then a graduate student in physics at Princeton, while on vacation with his cousin in Uzbekistan, <x>sentece 13</x> discovered this pattern on a 14th century Madrasa. <x>sentece 14</x> And after some analysis, concluded that this was in fact Penrose tiling, 500 years before Penrose. <x>sentece 15</x> That information took the scientific world by storm and prompted headlines everywhere, <x>sentece 16</x> including Discover Magazine, which proclaimed this the 59th most important scientific discovery of the Year 2007. <x>sentece 17</x> So- no- now we've heard about this amazing pattern from the point of view of mathematics <x>sentece 18</x> and from physics, and now <x>sentece 19</x> archen-archaeology. <x>sentece 20</x> So that leads us to the question: What was there about this pattern that this ancient culture found so important that they put it on their most important bu- building? <x>sentece 21</x> So for that, we look to the world of anthropology and ask the question: what was the worldview of the culture that made this? <x>sentece 22</x> And this is what we learn: <x>sentece 23</x> this pattern <x>sentece 24</x> is life, <x>sentece 25</x> and- and as you can see, <x>sentece 26</x> life's complicated.

Output:

<x>sentece 1</x> <x>sentece 10</x> <x>sentece 11</x> <x>sentece 12</x> <x>sentece 13</x> <x>sentece 14</x> <x>sentece 15</x> <x>sentece 16</x> <x>sentece 17</x> <x>sentece 18</x> <x>sentece 19</x> <x>sentece 20</x> <x>sentece 21</x> <x>sentece 22</x> <x>sentece 23</x> <x>sentece 24</x> <x>sentece 25</x> <x>sentece 26</x> this is Roger Penrose, certainly one of the great scientists of our time, <x>sentece 2</x> winner of the 2020 Nobel Prize in physics for his work Reconciling Black Holes with Einstein's general theory of relativity. <x>sentece 3</x> But back in the 1970s, Roger Penrose made a contribution to the world of mathematics and that part of mathematics known as tiling. <x>sentece 4</x> You know tiling? <x>sentece 5</x> The process of putting tiles together so that they form a particular pattern. <x>sentece 6</x> The thing that was remarkable about the pattern that Roger Penrose developed is that by using only two shapes, he constructed a pattern that could be expanded infinitely in any direction without ever repeating, <x>sentece 7</x> much like the number Pi has a decimal that isn't random, but it will go on forever without repeating. <x>sentece 8</x> In mathematics, this is a property known as aperiodicity. <x>sentece 9</x> And the noti 

code:

translator = deepl.Translator(AUTH_KEY)
resp = translator.translate_text(TEXT,
                                 target_lang="RU", 
                                 formality="more",  # or formality="less"
                                 tag_handling="xml", 
                                 ignore_tags="x",
                                 split_sentences=SplitSentences.OFF,
                                 preserve_formatting=True)

If I send shorter texts with same tag scheme - it works.
I'm setting SplitSentences.OFF because I don't want the sentences being splitted and the context lost.
Can you suggest a fast fix for this specific case?

@JanEbbing
Copy link
Member

Hi, thanks for the report and sorry for the delay here.

This is an issue with our translation model (tag handling requires model information to reinsert the tags at the right place in the translation) and I've reported it to the relevant team.
Unfortunately, we can't commit to fixing singular translation mistakes, but instead these errors will improve over time as we release new model versions in a regular cadence.

To alleviate the issue, you could look at using the new context parameter to split the text to be translated into smaller chunks while preserving translation quality through context (e.g. translate first 5 sentences in one request, then sentences 6-10 with 1-5 as context, etc).

@Agence-Superdev
Copy link

Agence-Superdev commented Nov 14, 2024

Hello, I'm having the same issue with every lang other than English, when I input this sentence :

  • "Fiche technique : S00120126FR-3<ign><tbsp></ign>Mise à jour le : 04/10/2024<ign><tbsp></ign>Créée le : 12/10/2023"

I recieve this in Bulgarian :

  • "<ign><tbsp></ign><ign><tbsp></ign>Лист с данни: S00120126EN-3 Актуализиран на: 04/10/2024 Създаден на: 12/10/2023"

While it work as intended in English :

  • "Data sheet: S00120126EN-3<ign><tbsp></ign>Updated on: 04/10/2024<ign><tbsp></ign>Created on: 12/10/2023"

This is the configuration I use :

{
                preserve_formatting: true,
                tag_handling: 'xml',
                ignore_tags: ['ign'],
                split_sentences: "nonewlines",
                outline_detection: false
}

I also tried all sort of configuration, none works, this happens in all language I tested beside english. Does someone have found a work arround ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants