From e4edff5c14dcff6efbb4833db871e8c91b7b4eb6 Mon Sep 17 00:00:00 2001
From: JJ Asghar <awesome@ibm.com>
Date: Mon, 9 Sep 2024 15:53:54 -0500
Subject: [PATCH] Added best practice guide for QNA

Best practices guide from the OCI and Lisa

Signed-off-by: JJ Asghar <awesome@ibm.com>
---
 docs/taxonomy/qna_yaml_best_practices.md | 63 ++++++++++++++++++++++++
 1 file changed, 63 insertions(+)
 create mode 100644 docs/taxonomy/qna_yaml_best_practices.md

diff --git a/docs/taxonomy/qna_yaml_best_practices.md b/docs/taxonomy/qna_yaml_best_practices.md
new file mode 100644
index 0000000..cc4d585
--- /dev/null
+++ b/docs/taxonomy/qna_yaml_best_practices.md
@@ -0,0 +1,63 @@
+## General `qna.yaml` best practices
+
+- Things to Avoid
+  - Historically, LLM is bad in math
+  - Do not provide complex math calculation in Q&A seeds.
+
+- Context
+  - What if knowledge is based on documents not existing in the base model?
+  - In the qna.yaml file, you can pass context within a chunk of information (text from the document that Q&A are based on). Adding context to the skill QnA file might generate better-quality data.
+
+- Formatting & Front-End specific and may change
+  - How to format data in the Q&A file especially how to format tables?
+  - Currently, only files in Markdown format are supported.
+    - If the files are in any other format, they must be converted to Markdown format
+    - For automatic converters, we recommend experimenting with other Markdown conversions like ‘markdown_strict’, ‘asciidoc’ and ‘gfm’
+
+- Intervene in Training
+  - Can I used generated json files to prompt-tuning (watsonx.ai) or using HuggingFace directly?
+  - The output of SDG is in json format and can also be used for traditional fine-tuning.
+
+- Quantities
+  - The number of seed examples
+  - How many seeds I should provide?
+  - The number of seeds:
+    - Generating ~300 QnA pairs from ~5 seed examples is recommended by InstructLab product team.
+    - Knowledge requires 5 pieces of context from document each with 3 QNAs specific to each context piece for a total of 15 qna pairs.
+    - We tried with less than 300 QnA pairs but found the QnA quality only satisfactory.
+    - The task description should be grounded in the domain/document.
+    - Due to this recommendation we should keep in mind that much complex cases can be splitted into smaller chunks of information
+
+- What is the size limit of context window in the Q&A file (qna.yaml)?
+  - Context size limitation:
+  - There is a ~2300 context size limit in the QnA yaml file.
+  - It is advised to keep the ground truth answers concise to respect this limit.
+
+- After Training
+  - How to check the quality of the data in a large data set of the qna.yaml file?
+  - You don’t have to check out synthetic data generated by the SDG process. After generating synthetic data internally, the IBM Research team is sampling to check quality (no need to check them all, especially for extensive set).
+
+- Quality
+  - How to measure quality of obtained data
+    - To evaluate SDG, you can use following a rating range (1-5):
+      - Irrelevant Answer
+      - Relevant but not close to ground truth, model might be hallucinating.
+      - Relevant, model not hallucinating, partly matching the ground truth.
+      - Relevant, model not hallucinating, model is adding irrelevant/unnecessary information
+      - Excellent Answer, Matches closely with Ground Truth
+
+  - Keep in mind:
+    - During the manual validation, it understood the entity and intent of the question and searched for the same entity and intent in the corresponding document. The document information was provided in the generated JSON file.
+    - At the next step, manual search validated it the steps or definitions contained in the answer were indeed in the corresponding document.
+
+  - How to enhance the quality of data generated in SDG 
+    - Task description: Add a task description relevant to the knowledge documents. We tried adding a custom task description to improve the SDG.
+    - Prompt template: Add guidelines for instruction and output to stick to document-related keywords and generate instructions from tables. We specifically added these instructions to the prompt template.
+    - Chunk word count: Increase the word count to increase the chunk sizes taken from the documents in SDG for long answered (Q-A pairs)
+    - Rouge threshold: To strictly enforce/penalize data quality, one can increase the rouge threshold in the iLab generate command.
+    - The question and answer pairs should be complete sentences, well formed, and use proper grammar.  Longer answers are better than a short yes or no.
+    - Also, the question and answer pairs must be answered by the associated context.
+
+- Formatting
+   - How many leaf nodes are kept in the taxonomy after adding a  Q&A file?
+   - The documents are kept in single leaf node and has one qna file and one attribution.txt.