diff --git a/README.md b/README.md index f27c43a24..74770e1eb 100644 --- a/README.md +++ b/README.md @@ -3,8 +3,8 @@ # OpenSSA: Neurosymbolic Agentic AI for Industrial Problem-Solving OpenSSA is an open-source neurosymbolic agentic AI framework -designed to solve complex, high-stakes problems in industries like semiconductor, manufacturing and finance, -where consistency, accuracy and deterministic outcomes are essential. +designed to solve complex, high-stakes problems in industries like semiconductor, energy and finance, +where consistency, accuracy and deterministic outcomes are paramount. At the core of OpenSSA is the [__Domain-Aware Neurosymbolic Agent (DANA)__](https://arxiv.org/abs/2410.02823) architecture, advancing generative AI from basic pattern matching and information retrieval to industrial-grade problem solving. diff --git a/docs/GETTING_STARTED.md b/docs/GETTING_STARTED.md index 79277276f..96e5025de 100644 --- a/docs/GETTING_STARTED.md +++ b/docs/GETTING_STARTED.md @@ -16,10 +16,10 @@ Go straight to [OpenSSA Streamlit app](https://openssa.streamlit.app/) and start ## Getting Started as a Developer -See some example user programs in the [examples/notebooks](./examples/notebooks) directory. For example, to see the sample use case on ALD semiconductor knowledge, do: +See some example user programs in the [examples](./examples) directory. For example, to see the sample use case on semiconductor knowledge, do: ```bash -% cd examples/notebooks +% cd examples/semiconductor ``` ### Common `make` targets for OpenSSA developers diff --git a/docs/diagrams/ssm-QA-vs-PS.drawio.png b/docs/diagrams/ssm-QA-vs-PS.drawio.png deleted file mode 100644 index b7258c66d..000000000 Binary files a/docs/diagrams/ssm-QA-vs-PS.drawio.png and /dev/null differ diff --git a/docs/diagrams/ssm-class-diagram.drawio.png b/docs/diagrams/ssm-class-diagram.drawio.png deleted file mode 100644 index 6825e32c1..000000000 Binary files a/docs/diagrams/ssm-class-diagram.drawio.png and /dev/null differ diff --git a/docs/diagrams/ssm-composability.drawio.png b/docs/diagrams/ssm-composability.drawio.png deleted file mode 100644 index b72645565..000000000 Binary files a/docs/diagrams/ssm-composability.drawio.png and /dev/null differ diff --git a/docs/diagrams/ssm-full-industrial-use-case.drawio.png b/docs/diagrams/ssm-full-industrial-use-case.drawio.png deleted file mode 100644 index 7d7a14e21..000000000 Binary files a/docs/diagrams/ssm-full-industrial-use-case.drawio.png and /dev/null differ diff --git a/docs/diagrams/ssm-industrial-use-case.drawio.png b/docs/diagrams/ssm-industrial-use-case.drawio.png deleted file mode 100644 index 343182cb4..000000000 Binary files a/docs/diagrams/ssm-industrial-use-case.drawio.png and /dev/null differ diff --git a/docs/diagrams/ssm-key-components.drawio.png b/docs/diagrams/ssm-key-components.drawio.png deleted file mode 100644 index 13770ee7d..000000000 Binary files a/docs/diagrams/ssm-key-components.drawio.png and /dev/null differ diff --git a/docs/diagrams/ssm-llama-index-integration-patterns.drawio.png b/docs/diagrams/ssm-llama-index-integration-patterns.drawio.png deleted file mode 100644 index 00a93dfb0..000000000 Binary files a/docs/diagrams/ssm-llama-index-integration-patterns.drawio.png and /dev/null differ diff --git a/docs/diagrams/ssm-llama-index-integration.drawio.png b/docs/diagrams/ssm-llama-index-integration.drawio.png deleted file mode 100644 index 557106f46..000000000 Binary files a/docs/diagrams/ssm-llama-index-integration.drawio.png and /dev/null differ diff --git a/docs/diagrams/ssm-team-of-experts.drawio.png b/docs/diagrams/ssm-team-of-experts.drawio.png deleted file mode 100644 index dc21e7437..000000000 Binary files a/docs/diagrams/ssm-team-of-experts.drawio.png and /dev/null differ diff --git a/examples/FinanceBench-Lite/.env.template b/examples/FinanceBench-Lite/.env.template new file mode 100644 index 000000000..9c9789785 --- /dev/null +++ b/examples/FinanceBench-Lite/.env.template @@ -0,0 +1,2 @@ +HF_API_KEY=[... HuggingFace API key if running HuggingFace-hosted models ...] +OPENAI_API_KEY=[... OpenAI API key if running on OpenAI services ...] diff --git a/examples/FinanceBench-Lite/.gitignore b/examples/FinanceBench-Lite/.gitignore new file mode 100644 index 000000000..1b80d89fc --- /dev/null +++ b/examples/FinanceBench-Lite/.gitignore @@ -0,0 +1,15 @@ +# data files +.data/ + +# environment variables +.env + +# iPython/Jupyter notebooks +*.ipynb + +# log files +.log/ +*.log + +# Streamlit secrets +.streamlit/secrets.toml diff --git a/examples/FinanceBench-Lite/Makefile b/examples/FinanceBench-Lite/Makefile new file mode 100644 index 000000000..dc5045571 --- /dev/null +++ b/examples/FinanceBench-Lite/Makefile @@ -0,0 +1,33 @@ +dana-solve: + @poetry run python dana.py ${id} + +dana-solve-w-knowledge: + @poetry run python dana.py ${id} --knowledge + +dana-solve-w-prog-store: + @poetry run python dana.py ${id} --prog-store + +dana-solve-w-knowledge-and-prog-store: + @poetry run python dana.py ${id} --knowledge --prog-store + +dana-solve-w-llama3: + @poetry run python dana.py ${id} --llama3 + +dana-solve-w-knowledge-w-llama3: + @poetry run python dana.py ${id} --knowledge --llama3 + +dana-solve-w-prog-store-w-llama3: + @poetry run python dana.py ${id} --prog-store --llama3 + +dana-solve-w-knowledge-and-prog-store-w-llama3: + @poetry run python dana.py ${id} --knowledge --prog-store --llama3 + +dana-solve-all-combos: + @poetry run python dana.py ${id} + @poetry run python dana.py ${id} --knowledge + @poetry run python dana.py ${id} --prog-store + @poetry run python dana.py ${id} --knowledge --prog-store + @poetry run python dana.py ${id} --llama3 + @poetry run python dana.py ${id} --knowledge --llama3 + @poetry run python dana.py ${id} --prog-store --llama3 + @poetry run python dana.py ${id} --knowledge --prog-store --llama3 diff --git a/examples/FinanceBench-Lite/README.md b/examples/FinanceBench-Lite/README.md new file mode 100644 index 000000000..6b27245db --- /dev/null +++ b/examples/FinanceBench-Lite/README.md @@ -0,0 +1,58 @@ + + +# OpenSSA-FinanceBench Lite benchmarking + +This is a lite version of the benchmarking of `OpenSSA` performance +on the `FinanceBench` dataset. We will use 1 question from the dataset to demonstrate the use of `OpenSSA` with `DANA` architecture. + +## [`FinanceBench` Dataset](https://github.com/patronus-ai/financebench/blob/main/financebench_sample_150.csv) + +## Getting Started with DANA Agent + +Have Python 3.12 installed. + +__Install__ project, and update its dependencies from time to time: +__`make install`__. + +Create `.env` file following the `.env.template` and fill in necessary credentials. + +__Solve__ the problem corresponding to a problem `00807` `financebench_id`: +__`make dana-solve id=00807`__. + +### Question + +`Does 3M have a reasonably healthy liquidity profile based on its quick ratio for Q2 of FY2023? If the quick ratio is not relevant to measure liquidity, please state that and explain why.` + +### Knowledge + +To solve this question, you can add knowledge related to `liquidity`. See the example below: + +- Liquidity Metric Formulas + - `(Net) Working Capital` = `(Total) Current Assets` - `(Total) Current Liabilities` + - `Working Capital Ratio` = `(Total) Current Assets` / `(Total) Current Liabilities` + +Go to `knowledge-store.txt` to add relevant knowledge yourself and see how it helps the agent to solve this question. + +### Program + +With the above-provided knowledge, the program we can provide to the agent could be as below: + +- Goal: To assess liquidity health of a company, calculate `quick ratio` + - Task: To calculate `quick ratio`, use this formula + `Quick Ratio` = ( + (`Cash & Cash Equivalents` + + `Short-Term Investments or (Current) Marketable Securities` + + `(Net) Accounts Receivable, a.k.a. (Net) (Trade) Receivables`) + / `(Total) Current Liabilities` + ) + - Sub-task 1: What are values in dollars of `Cash & Cash Equivalents`? + - Sub-task 2: What are values in dollars of `Short-Term Investments or (Current) Marketable Securities`? + - Sub-task 3: What are values in dollars of `(Net) Accounts Receivable, a.k.a. (Net) (Trade) Receivables`? + - Sub-task 4: What are values in dolloars of `(Total) Current Liabilities`? + +Go to `program-store.yml` to see details of the program yourself! You can experimenting with different plans to see how it helps the agent solve the problem as well. + +## Advancing DANA Agent with Domain Knowledge and Program Store + +- To solve the question with added domain knowledge, run `make dana-solve-w-knowledge id=00807` +- To solve the question with added domain knowledge and program store, run `make dana-solve-w-knowledge-and-prog-store id=00807` diff --git a/examples/FinanceBench-Lite/dana.py b/examples/FinanceBench-Lite/dana.py new file mode 100644 index 000000000..92ec4ee61 --- /dev/null +++ b/examples/FinanceBench-Lite/dana.py @@ -0,0 +1,155 @@ +from argparse import ArgumentParser +from functools import cache + +from openssa import DANA, ProgramStore, HTP, HTPlanner, FileResource, LMConfig +from openssa.core.util.lm.huggingface import HuggingFaceLM +from openssa.core.util.lm.openai import OpenAILM, default_llama_index_openai_lm + +# pylint: disable=wrong-import-order,wrong-import-position +from data_and_knowledge import (DocName, FbId, Answer, Doc, FB_ID_COL_NAME, DOC_NAMES_BY_FB_ID, QS_BY_FB_ID, + EXPERT_KNOWLEDGE, EXPERT_PROGRAMS, EXPERT_HTP_COMPANY_KEY, EXPERT_HTP_PERIOD_KEY) +from util import QAFunc, enable_batch_qa_and_eval, log_qa_and_update_output_file + + +@cache +def get_main_lm(use_llama3: bool = False): + return (HuggingFaceLM if use_llama3 else OpenAILM).from_defaults() + + +@cache +def get_or_create_expert_program_store(use_llama3: bool = False) -> ProgramStore: + program_store = ProgramStore(lm=get_main_lm(use_llama3=use_llama3)) + + for program_name, htp_dict in EXPERT_PROGRAMS.items(): + htp = HTP.from_dict(htp_dict) + program_store.add_or_update_program(name=program_name, description=htp.task.ask, program=htp) + + return program_store + + +@cache +def get_or_create_agent(doc_name: DocName, expert_knowledge: bool = False, expert_programs: bool = False, + max_depth=3, max_subtasks_per_decomp=6, + use_llama3: bool = False, + llama_index_openai_lm_name: str = LMConfig.OPENAI_DEFAULT_MODEL) -> DANA: + # pylint: disable=too-many-arguments + return DANA(knowledge={EXPERT_KNOWLEDGE} if expert_knowledge else None, + + program_store=(get_or_create_expert_program_store(use_llama3=use_llama3) + if expert_programs + else ProgramStore()), + + programmer=HTPlanner(lm=get_main_lm(use_llama3=use_llama3), + max_depth=max_depth, max_subtasks_per_decomp=max_subtasks_per_decomp), + + resources={FileResource(path=Doc(name=doc_name).dir_path, + lm=default_llama_index_openai_lm(llama_index_openai_lm_name))}) + + +@cache +def get_or_create_adaptations(doc_name: DocName) -> dict[str, str]: + return {EXPERT_HTP_COMPANY_KEY: (doc := Doc(name=doc_name)).company, EXPERT_HTP_PERIOD_KEY: doc.period} + + +@enable_batch_qa_and_eval(output_name='DANA') +@log_qa_and_update_output_file(output_name='DANA') +def solve(fb_id: FbId) -> Answer: + return get_or_create_agent(doc_name=DOC_NAMES_BY_FB_ID[fb_id]).solve( + problem=QS_BY_FB_ID[fb_id], + adaptations_from_known_programs=get_or_create_adaptations(doc_name=DOC_NAMES_BY_FB_ID[fb_id])) + + +@enable_batch_qa_and_eval(output_name='DANA-wKnowledge') +@log_qa_and_update_output_file(output_name='DANA-wKnowledge') +def solve_with_knowledge(fb_id: FbId) -> Answer: + return get_or_create_agent(doc_name=DOC_NAMES_BY_FB_ID[fb_id], expert_knowledge=True).solve( + problem=QS_BY_FB_ID[fb_id], + adaptations_from_known_programs=get_or_create_adaptations(doc_name=DOC_NAMES_BY_FB_ID[fb_id])) + + +@enable_batch_qa_and_eval(output_name='DANA-wProgStore') +@log_qa_and_update_output_file(output_name='DANA-wProgStore') +def solve_with_program_store(fb_id: FbId) -> Answer: + return get_or_create_agent(doc_name=DOC_NAMES_BY_FB_ID[fb_id], expert_programs=True).solve( + problem=QS_BY_FB_ID[fb_id], + adaptations_from_known_programs=get_or_create_adaptations(doc_name=DOC_NAMES_BY_FB_ID[fb_id])) + + +@enable_batch_qa_and_eval(output_name='DANA-wKnowledge-wProgStore') +@log_qa_and_update_output_file(output_name='DANA-wKnowledge-wProgStore') +def solve_with_knowledge_and_program_store(fb_id: FbId) -> Answer: + return get_or_create_agent(DOC_NAMES_BY_FB_ID[fb_id], expert_knowledge=True, expert_programs=True).solve( + problem=QS_BY_FB_ID[fb_id], + adaptations_from_known_programs=get_or_create_adaptations(doc_name=DOC_NAMES_BY_FB_ID[fb_id])) + + +@enable_batch_qa_and_eval(output_name='DANA-wLlama3') +@log_qa_and_update_output_file(output_name='DANA-wLlama3') +def solve_with_llama3(fb_id: FbId) -> Answer: + return get_or_create_agent(doc_name=DOC_NAMES_BY_FB_ID[fb_id], use_llama3=True).solve( + problem=QS_BY_FB_ID[fb_id], + adaptations_from_known_programs=get_or_create_adaptations(doc_name=DOC_NAMES_BY_FB_ID[fb_id])) + + +@enable_batch_qa_and_eval(output_name='DANA-wKnowledge-wLlama3') +@log_qa_and_update_output_file(output_name='DANA-wKnowledge-wLlama3') +def solve_with_knowledge_with_llama3(fb_id: FbId) -> Answer: + return get_or_create_agent(doc_name=DOC_NAMES_BY_FB_ID[fb_id], expert_knowledge=True, use_llama3=True).solve( + problem=QS_BY_FB_ID[fb_id], + adaptations_from_known_programs=get_or_create_adaptations(doc_name=DOC_NAMES_BY_FB_ID[fb_id])) + + +@enable_batch_qa_and_eval(output_name='DANA-wProgStore-wLlama3') +@log_qa_and_update_output_file(output_name='DANA-wProgStore-wLlama3') +def solve_with_program_store_with_llama3(fb_id: FbId) -> Answer: + return get_or_create_agent(doc_name=DOC_NAMES_BY_FB_ID[fb_id], expert_programs=True, use_llama3=True).solve( + problem=QS_BY_FB_ID[fb_id], + adaptations_from_known_programs=get_or_create_adaptations(doc_name=DOC_NAMES_BY_FB_ID[fb_id])) + + +@enable_batch_qa_and_eval(output_name='DANA-wKnowledge-wProgStore-wLlama3') +@log_qa_and_update_output_file(output_name='DANA-wKnowledge-wProgStore-wLlama3') +def solve_with_knowledge_and_program_store_with_llama3(fb_id: FbId) -> Answer: + return get_or_create_agent(DOC_NAMES_BY_FB_ID[fb_id], expert_knowledge=True, expert_programs=True, use_llama3=True).solve( # noqa: E501 + problem=QS_BY_FB_ID[fb_id], + adaptations_from_known_programs=get_or_create_adaptations(doc_name=DOC_NAMES_BY_FB_ID[fb_id])) + + +if __name__ == '__main__': + arg_parser = ArgumentParser() + arg_parser.add_argument('fb_id') + arg_parser.add_argument('--from-id', action='store_true') + arg_parser.add_argument('--knowledge', action='store_true') + arg_parser.add_argument('--prog-store', action='store_true') + arg_parser.add_argument('--llama3', action='store_true') + args = arg_parser.parse_args() + + match (args.knowledge, args.prog_store, args.llama3): + case (False, False, False): + solve_func: QAFunc = solve + + case (True, False, False): + solve_func: QAFunc = solve_with_knowledge + + case (False, True, False): + solve_func: QAFunc = solve_with_program_store + + case (True, True, False): + solve_func: QAFunc = solve_with_knowledge_and_program_store + + case (False, False, True): + solve_func: QAFunc = solve_with_llama3 + + case (True, False, True): + solve_func: QAFunc = solve_with_knowledge_with_llama3 + + case (False, True, True): + solve_func: QAFunc = solve_with_program_store_with_llama3 + + case (True, True, True): + solve_func: QAFunc = solve_with_knowledge_and_program_store_with_llama3 + + if not (fb_id := args.fb_id).startswith(FB_ID_COL_NAME): + fb_id: FbId = f'{FB_ID_COL_NAME}_{fb_id}' + + solve_func(f'from:{fb_id}' if args.from_id else fb_id) diff --git a/examples/FinanceBench-Lite/data_and_knowledge.py b/examples/FinanceBench-Lite/data_and_knowledge.py new file mode 100644 index 000000000..7dbf1e41e --- /dev/null +++ b/examples/FinanceBench-Lite/data_and_knowledge.py @@ -0,0 +1,332 @@ +from __future__ import annotations + +from collections import Counter +from dataclasses import dataclass, field +import base64 +from enum import StrEnum +from functools import cached_property +from pathlib import Path +from typing import TypedDict, Required, NotRequired, Literal, TYPE_CHECKING + +from dotenv import load_dotenv +from pandas import DataFrame, read_json, read_csv +import requests +import yaml + +if TYPE_CHECKING: + from openssa.core.planning.hierarchical.plan import HTPDict + + +load_dotenv() + + +type DocName = str +type FbId = str +type Question = str +type Answer = str +type ExpertPlanId = str + + +class Category(StrEnum): + RETRIEVE: str = '0-RETRIEVE' + COMPARE: str = '1-COMPARE' + CALC_CHANGE: str = '2-CALC-CHANGE' + CALC_COMPLEX: str = '3-CALC-COMPLEX' + CALC_AND_JUDGE: str = '4-CALC-AND-JUDGE' + EXPLAIN_FACTORS: str = '5-EXPLAIN-FACTORS' + OTHER_ADVANCED: str = '6-OTHER-ADVANCED' + + +type GroundTruth = TypedDict('GroundTruth', {'sector': Required[str], + + 'company': Required[str], + 'period': Required[int], + 'doc-type': Required[str], + 'doc': Required[DocName], + + 'question-type': Required[str], + 'question-reasoning': Required[str], + 'domain-question-num': Required[str | None], + 'question': Required[Question], + + 'answer': Required[Answer], + 'justification': Required[str], + 'page(s)-0based': Required[int], + 'page(s)': Required[str], + + 'category': Required[Category], + 'correctness': Required[str], + 'answer-inadequate': NotRequired[Literal[True]], + 'evaluator-unreliable': NotRequired[Literal[True]]}, + total=False) + + +type RAGGroundTruths = TypedDict('RAGGroundTruths', {'defs': Required[dict[str, str]], + 'ground-truths': Required[dict[str, # doc + dict[str, # statement + dict[str, # line item + dict[int | str, # period + str # ground truth + ]]]]]}) + + +NON_BOT_REQUEST_HEADERS: dict[str, str] = { + "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3" +} + + +REPO_RAW_CONTENT_URL_PREFIX: str = 'https://raw.githubusercontent.com/patronus-ai/financebench' +DOC_INFO_URL: str = f'{REPO_RAW_CONTENT_URL_PREFIX}/main/data/financebench_document_information.jsonl' +METADATA_JSONL_URL: str = f'{REPO_RAW_CONTENT_URL_PREFIX}/main/data/financebench_open_source.jsonl' +METADATA_CSV_URL: str = f'{REPO_RAW_CONTENT_URL_PREFIX}/641ae9ece2cae93c671cf59c2d53742b51c7f1aa/financebench_sample_150.csv' + +FB_ID_COL_NAME: str = 'financebench_id' + +META_DF: DataFrame = (read_json(METADATA_JSONL_URL, + orient='records', typ='frame', + dtype=True, convert_axes=True, + convert_dates=True, keep_default_dates=True, + precise_float=False, date_unit=None, + encoding='utf-8', encoding_errors='strict', + lines=True, chunksize=None, + compression=None, nrows=None, + storage_options=None, + dtype_backend='pyarrow', engine='ujson') + + .merge(right=read_json( + DOC_INFO_URL, + orient='records', typ='frame', + dtype=True, convert_axes=True, + convert_dates=True, keep_default_dates=True, + precise_float=False, date_unit=None, + encoding='utf-8', encoding_errors='strict', + lines=True, chunksize=None, + compression=None, nrows=None, + storage_options=None, + dtype_backend='pyarrow', engine='ujson'), + + how='left', on='doc_name', # left_on='doc_name', right_on='doc_name', + left_index=False, right_index=False, + sort=False, + suffixes=('', '_'), + copy=False, + indicator=False, + validate=None # TODO: 'many_to_one' after Patronus AI fixes FOOTLOCKER_2022_annualreport + ) + + .set_index(keys=FB_ID_COL_NAME, + drop=True, append=False, + inplace=False, + verify_integrity=True)) + +META_DF.fillna(value='', method=None, axis=None, inplace=True, limit=None) # replace PyArrow NAs + +LEGACY_META_DF: DataFrame = read_csv(METADATA_CSV_URL, + sep=',', # delimiter=',', + header='infer', names=None, index_col=FB_ID_COL_NAME, usecols=None, + dtype=None, engine='pyarrow', converters=None, true_values=None, false_values=None, + skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, + na_values=None, na_filter=None, keep_default_na=True, + skip_blank_lines=True, + parse_dates=False, date_format=None, dayfirst=False, cache_dates=True, + iterator=False, chunksize=None, compression=None, + thousands=None, decimal='.', + lineterminator=None, + quotechar=None, quoting=0, doublequote=True, + escapechar=None, comment=None, + encoding='utf-8', encoding_errors='strict', + dialect=None, + on_bad_lines='error', + low_memory=True, memory_map=False, + float_precision=None, + storage_options=None, + dtype_backend='pyarrow') + +assert (META_DF.index == LEGACY_META_DF.index).all() +# assert (META_DF.doc_name == LEGACY_META_DF.doc_name).all() # J&J docs have been fixed +assert (META_DF.doc_period == LEGACY_META_DF.doc_period).all() +assert (META_DF.doc_link == LEGACY_META_DF.doc_link).all() +assert (META_DF.question_type == LEGACY_META_DF.question_type).all() +assert (META_DF.question == LEGACY_META_DF.question).all() +# assert (META_DF.answer == LEGACY_META_DF.answer).all() # 01107 answer has been fixed + +DOC_NAMES: list[DocName] = sorted(META_DF.doc_name.unique()) +DOC_LINKS_BY_NAME: dict[DocName, str] = dict(zip(META_DF.doc_name, META_DF.doc_link)) +DOC_NAMES_BY_FB_ID: dict[FbId, DocName] = META_DF.doc_name.to_dict() + +FB_IDS: list[FbId] = META_DF.index.to_list() +FB_IDS_BY_DOC_NAME: dict[DocName, list[FbId]] = META_DF.groupby('doc_name').apply(lambda _: _.index.to_list()) + +QS_BY_FB_ID: dict[FbId, Question] = META_DF.question.to_dict() + + +LOCAL_CACHE_DIR_PATH: Path = Path(__file__).parent / '.data' +LOCAL_CACHE_DOCS_DIR_PATH: Path = LOCAL_CACHE_DIR_PATH / 'docs' +OUTPUT_FILE_PATH: Path = LOCAL_CACHE_DIR_PATH / 'output.csv' + + +GROUND_TRUTHS_FILE_PATH = Path(__file__).parent / 'ground-truths.yml' +with open(file=GROUND_TRUTHS_FILE_PATH, + buffering=-1, + encoding='utf-8', + errors='strict', + newline=None, + closefd=True, + opener=None) as f: + GROUND_TRUTHS: dict[FbId, GroundTruth] = yaml.safe_load(stream=f) + +N_CASES: int = len(GROUND_TRUTHS) +CAT_DISTRIB: Counter[Category] = Counter(ground_truth['category'] for ground_truth in GROUND_TRUTHS.values()) + + +EXPERT_KNOWLEDGE_FILE_PATH: Path = Path(__file__).parent / 'knowledge-store.txt' +with open(file=EXPERT_KNOWLEDGE_FILE_PATH, + buffering=-1, + encoding='utf-8', + errors='strict', + newline=None, + closefd=True, + opener=None) as f: + EXPERT_KNOWLEDGE: str = f.read() + + +EXPERT_PROGRAMS_FILE_PATH: Path = Path(__file__).parent / 'program-store.yml' +with open(file=EXPERT_PROGRAMS_FILE_PATH, + buffering=-1, + encoding='utf-8', + errors='strict', + newline=None, + closefd=True, + opener=None) as f: + EXPERT_PROGRAMS: dict[ExpertPlanId, HTPDict] = yaml.safe_load(stream=f) + +EXPERT_HTP_COMPANY_KEY: str = 'COMPANY' +EXPERT_HTP_PERIOD_KEY: str = 'PERIOD' + + +RAG_GROUND_TRUTHS_FILE_PATH: Path = Path(__file__).parent / 'rag-ground-truths.yml' +with open(file=RAG_GROUND_TRUTHS_FILE_PATH, + buffering=-1, + encoding='utf-8', + errors='strict', + newline=None, + closefd=True, + opener=None) as f: + RAG_GROUND_TRUTHS: RAGGroundTruths = yaml.safe_load(stream=f) + + +@dataclass +class Doc: + name: DocName + company: str = field(init=False, repr=False) + period: str = field(init=False, repr=False) + type: str = field(init=False, repr=False) + + def __post_init__(self): + self.company, self.period, self.type = self.name.split(sep='_', maxsplit=2) + + def request(self) -> requests.Response: + try: + response: requests.Response = requests.get( + url=(url := ((base64.b64decode(doc_link.split(sep=q, maxsplit=-1)[-1], altchars=None) + .decode(encoding='utf-8', errors='strict')) + if (q := '?pdfTarget=') in (doc_link := DOC_LINKS_BY_NAME[self.name]) + else doc_link)), + timeout=60, + stream=True) + + except requests.exceptions.ConnectionError: + response: requests.Response = requests.get( + url=(url := f'{REPO_RAW_CONTENT_URL_PREFIX}/main/pdfs/{self.name}.pdf'), + timeout=60, + stream=True) + + if response.headers.get('Content-Type') != 'application/pdf': + response: requests.Response = requests.get(url=url, + headers=NON_BOT_REQUEST_HEADERS, + timeout=60, + stream=True) + + return response + + @cached_property + def dir_path(self) -> Path: + dir_path: Path = LOCAL_CACHE_DOCS_DIR_PATH / self.name + + if not (file_path := dir_path / f'{self.name}.pdf').is_file(): + dir_path.mkdir(parents=True, exist_ok=True) + + response: requests.Response = self.request() + + with open(file=file_path, mode='wb', buffering=-1, encoding=None, newline=None, closefd=True, opener=None) as f: + f.write(response.content) + + return dir_path + + @cached_property + def file_path(self) -> Path: + return self.dir_path / f'{self.name}.pdf' + + +def create_or_update_ground_truths() -> dict[FbId, GroundTruth]: + ground_truths: dict[FbId, GroundTruth] = {fb_id: {'sector': row.gics_sector, + 'company': row.company, 'period': row.doc_period, 'doc-type': row.doc_type, + 'doc': row.doc_name, + 'question-type': row.question_type, + 'question-reasoning': row.question_reasoning, + 'domain-question-num': row.domain_question_num, + 'question': row.question, + 'answer': row.answer, 'justification': row.justification, + 'page(s)-0based': row.evidence[0]['evidence_page_num']} + for fb_id, row in META_DF.iterrows()} + + if GROUND_TRUTHS_FILE_PATH.is_file(): + with open(file=GROUND_TRUTHS_FILE_PATH, + buffering=-1, + encoding='utf-8', + errors='strict', + newline=None, + closefd=True, + opener=None) as f: + existing_ground_truths: dict[FbId, GroundTruth] = yaml.safe_load(stream=f) + + for fb_id, ground_truth in ground_truths.items(): + if (existing_ground_truth := existing_ground_truths.get(fb_id)): + for existing_key in set(existing_ground_truth).difference(ground_truth): + ground_truth[existing_key] = existing_ground_truth[existing_key] + + with open(file=GROUND_TRUTHS_FILE_PATH, + mode='w', + buffering=-1, + encoding='utf-8', + errors='strict', + newline=None, + closefd=True, + opener=None) as f: + yaml.safe_dump(data=ground_truths, + stream=f, + default_style=None, + default_flow_style=False, + canonical=None, + indent=2, + width=80, + allow_unicode=True, + line_break=None, + encoding='utf-8', + explicit_start=None, + explicit_end=None, + version=None, + tags=None, + sort_keys=False) + + return ground_truths + + +def get_or_create_output_df() -> DataFrame: + output_df: DataFrame = (read_csv(OUTPUT_FILE_PATH, index_col=FB_ID_COL_NAME) + if OUTPUT_FILE_PATH.is_file() + else META_DF[['doc_name', 'question', 'answer']]) + + output_df.loc[:, 'category'] = [GROUND_TRUTHS[fb_id]['category'] for fb_id in output_df.index] + + return output_df diff --git a/examples/FinanceBench-Lite/eval.py b/examples/FinanceBench-Lite/eval.py new file mode 100644 index 000000000..77f491f4f --- /dev/null +++ b/examples/FinanceBench-Lite/eval.py @@ -0,0 +1,301 @@ +from __future__ import annotations + +import argparse +from collections import defaultdict +from functools import cache +from pprint import pprint +from typing import TYPE_CHECKING + +from dotenv import load_dotenv +from loguru import logger +from pandas import DataFrame, notna, read_csv +from tqdm import tqdm + +from openssa.core.util.lm.config import LMConfig +from openssa.core.util.lm.openai import OpenAILM + +# pylint: disable=wrong-import-order +from data_and_knowledge import (FbId, Question, Answer, Category, GroundTruth, + FB_ID_COL_NAME, GROUND_TRUTHS, N_CASES, CAT_DISTRIB, + LOCAL_CACHE_DIR_PATH, OUTPUT_FILE_PATH, get_or_create_output_df) +from log import switch_log_file + +if TYPE_CHECKING: + from openssa.core.util.lm.abstract import AbstractLM + + +EVAL_PROMPT_TEMPLATE: str = \ +"""You shall act as a judge of question-answering correctness. + +Given the posed QUESTION below, evaluate whether the ANSWER below is correct +according to the criteria specified in the CORRECTNESS EVALUATION RUBRIC below. + +- The evaluation should regard the ANSWER as responding to the QUESTION, + and hence the ANSWER does not need to repeat contextual information already in the QUESTION; + +- The evaluation should follow the RUBRIC strictly, + not looking for in the ANSWER more elaboration/explanation than what the RUBRIC explicitly requires; + +- Financial and technical terminology can be treated as case-insensitive. + +Output only a single word, either: +- YES: if you judge the ANSWER to be correct; or +- NO: if you judge the ANSWER to be incorrect. + +QUESTION: +--------- +``` +{question} +``` + +ANSWER TO EVALUATE: +------------------- +``` +{answer} +``` + +CORRECTNESS EVALUATION RUBRIC: +------------------------------ +``` +{rubric} +``` +""" # noqa: E122 + + +load_dotenv() + + +@cache +def get_lm(model='gpt-4o') -> AbstractLM: + return OpenAILM(model=model, api_key=LMConfig.OPENAI_API_KEY, api_base=LMConfig.OPENAI_API_URL) + + +def human_eval_recommended(fb_id: FbId) -> bool | None: + return (ground_truth := GROUND_TRUTHS[fb_id]).get('answer-inadequate') or ground_truth.get('evaluator-unreliable') + + +def eval_correctness(fb_id: FbId, answer: Answer, output_name: str | None = None, # pylint: disable=too-many-arguments + n_times: int = 9, human: bool = True, debug: bool = False) -> bool: + if output_name: + switch_log_file(fb_id=fb_id, output_name=output_name) + + question: Question = (ground_truth := GROUND_TRUTHS[fb_id])['question'] + rubric: str = ground_truth['correctness'] + prompt: str = EVAL_PROMPT_TEMPLATE.format(question=question, answer=answer, rubric=rubric) + + lm: AbstractLM = get_lm() + + for _ in range(n_times): + score: str = '' + + while score not in {'YES', 'NO'}: + score: str = lm.get_response(prompt=prompt, temperature=0) + + if score == 'NO': + logger.warning(f'\n{fb_id}\n{ground_truth['doc']}:\n{question}\n' + '\n' + f'ANSWER JUDGED TO BE INCORRECT:\n{answer}\n' + '\n' + f'RUBRIC:\n{rubric}' + + ('\n\n(*** EXPERT ANSWER KNOWN TO BE INADEQUATE ***)\n' + if GROUND_TRUTHS[fb_id].get('answer-inadequate') + else '\n')) + + if debug: + logger.debug(f'PROMPT:\n{prompt}') + + if human and human_eval_recommended(fb_id=fb_id): + human_eval_str: str = '' + while not human_eval_str: + human_eval_str: str = input('\n*** HUMAN EVALUATION ***: if answer is correct, type "Y": ').strip() + + correct: bool = human_eval_str.lower().startswith('y') + + else: + correct: bool = False + + break + + else: + correct: bool = True + + if output_name: + output_df: DataFrame = get_or_create_output_df() + output_df.loc[fb_id, f'{output_name}---CORRECTNESS']: bool = correct + output_df.to_csv(OUTPUT_FILE_PATH, index=True) + + return correct + + +def eval_all(output_name: str, refresh: bool = True, n_times: int = 9, human: bool = True, debug: bool = False): + # pylint: disable=too-many-locals + output_df: DataFrame = get_or_create_output_df() + + n_yes_scores_by_category: defaultdict = defaultdict(int) + incorrect_answer_fb_ids: dict[FbId, str] = {} + + for fb_id, answer in tqdm(output_df[output_name].items(), total=N_CASES): + ground_truth: GroundTruth = GROUND_TRUTHS[fb_id] + + if (eval_correctness(fb_id=fb_id, answer=answer, output_name=output_name, n_times=n_times, human=human, debug=debug) # noqa: E501 + if refresh + else (notna(correctness := output_df.loc[fb_id, f'{output_name}---CORRECTNESS']) and correctness)): + n_yes_scores_by_category[ground_truth['category']] += 1 + + else: + incorrect_answer_fb_ids[fb_id]: str = ('expert answer inadequate' + if ground_truth.get('answer-inadequate') + else ('evaluator unreliable' + if ground_truth.get('evaluator-unreliable') + else '')) + + logger.info(f'TOTAL CORRECT: {(n := sum(n_yes_scores_by_category.values()))} / {N_CASES} = {n / N_CASES:.1%}') + + pprint(correctness_by_category := {category: (f'{(n := n_yes_scores_by_category[category])} / {n_for_category} ' + f'= {n / n_for_category:.1%}') + for category, n_for_category in CAT_DISTRIB.items()}) + + pprint({ + 'EASY': (f'{(e := sum(n_yes_scores_by_category[easy_cat] + for easy_cat in (Category.RETRIEVE, Category.COMPARE, Category.CALC_CHANGE)))} / ' + f'{(se := sum(CAT_DISTRIB[easy_cat] + for easy_cat in (Category.RETRIEVE, Category.COMPARE, Category.CALC_CHANGE)))} ' + f'= {e / se:.1%}'), + + 'HARD': (f'{(h := sum(n_yes_scores_by_category[hard_cat] + for hard_cat in (Category.CALC_COMPLEX, Category.CALC_AND_JUDGE, + Category.EXPLAIN_FACTORS, Category.OTHER_ADVANCED)))} / ' + f'{(sh := sum(CAT_DISTRIB[hard_cat] + for hard_cat in (Category.CALC_COMPLEX, Category.CALC_AND_JUDGE, + Category.EXPLAIN_FACTORS, Category.OTHER_ADVANCED)))} ' + f'= {h / sh:.1%}') + }) + + logger.warning('INCORRECT:') + pprint(incorrect_answer_fb_ids) + + return correctness_by_category + + +def compare_eval(output_name: str, baseline_output_name: str = 'RAG-Default'): + output_df: DataFrame = get_or_create_output_df() + + baseline_correctness_by_category: dict[str, str] = eval_all(output_name=baseline_output_name, refresh=False) + correctness_by_category: dict[str, str] = eval_all(output_name=output_name, refresh=False) + pprint({category: {output_name: correctness_summary, baseline_output_name: baseline_correctness_by_category[category]} + for category, correctness_summary in correctness_by_category.items()}) + + output_df.loc[:, baseline_output_name] = output_df[f'{baseline_output_name}---CORRECTNESS'] + output_df.loc[:, output_name] = output_df[f'{output_name}---CORRECTNESS'] + return output_df.loc[output_df[output_name] != output_df[baseline_output_name], + ['doc_name', 'category', baseline_output_name, output_name]] + + +def eval_accuracy_and_consistency_wrt_ground_truths(output_name: str, output_file_names: list[str]): + # pylint: disable=too-many-locals + + n_output_files: int = len(output_file_names) + correctness_col_name: str = f'{output_name}---CORRECTNESS' + + n_yes_scores_by_fb_id: defaultdict = defaultdict(int) + incorrect_answer_fb_ids: dict[FbId, str] = {} + + for output_df in (read_csv(LOCAL_CACHE_DIR_PATH / output_file_name, index_col=FB_ID_COL_NAME) + for output_file_name in output_file_names): + + for fb_id, correctness in output_df[correctness_col_name].items(): + ground_truth: GroundTruth = GROUND_TRUTHS[fb_id] + + if notna(correctness) and correctness: + n_yes_scores_by_fb_id[fb_id] += 1 + + else: + incorrect_answer_fb_ids[fb_id]: str = ('expert answer inadequate' + if ground_truth.get('answer-inadequate') + else ('evaluator unreliable' + if ground_truth.get('evaluator-unreliable') + else '')) + + cumu_avg_accuracy_scores_by_category: defaultdict = defaultdict(int) + cumu_consistency_scores_by_category: defaultdict = defaultdict(float) + + for fb_id, ground_truth in GROUND_TRUTHS.items(): + cumu_avg_accuracy_scores_by_category[cat := ground_truth['category']] += (a := n_yes_scores_by_fb_id[fb_id] / n_output_files) + cumu_consistency_scores_by_category[cat] += 2 * abs(a - 0.5) + + print(f'TOTAL CORRECT: {(n := sum(cumu_avg_accuracy_scores_by_category.values()))} / {N_CASES} = {n / N_CASES:.1%}') + + pprint({category: (f'{(n := cumu_avg_accuracy_scores_by_category[category])} / {n_for_category} ' + f'= {n / n_for_category:.1%}') + for category, n_for_category in CAT_DISTRIB.items()}) + + pprint({ + 'EASY': (f'{(e := sum(cumu_avg_accuracy_scores_by_category[easy_cat] + for easy_cat in (Category.RETRIEVE, Category.COMPARE, Category.CALC_CHANGE)))} / ' + f'{(se := sum(CAT_DISTRIB[easy_cat] + for easy_cat in (Category.RETRIEVE, Category.COMPARE, Category.CALC_CHANGE)))} ' + f'= {e / se:.1%}'), + + 'HARD': (f'{(h := sum(cumu_avg_accuracy_scores_by_category[hard_cat] + for hard_cat in (Category.CALC_COMPLEX, Category.CALC_AND_JUDGE, + Category.EXPLAIN_FACTORS, Category.OTHER_ADVANCED)))} / ' + f'{(sh := sum(CAT_DISTRIB[hard_cat] + for hard_cat in (Category.CALC_COMPLEX, Category.CALC_AND_JUDGE, + Category.EXPLAIN_FACTORS, Category.OTHER_ADVANCED)))} ' + f'= {h / sh:.1%}') + }) + + print(f'\nTOTAL CONSISTENT: {(n := sum(cumu_consistency_scores_by_category.values()))} / {N_CASES} = {n / N_CASES:.1%}') + + pprint({category: (f'{(n := cumu_consistency_scores_by_category[category])} / {n_for_category} ' + f'= {n / n_for_category:.1%}') + for category, n_for_category in CAT_DISTRIB.items()}) + + pprint({ + 'EASY': (f'{(e := sum(cumu_consistency_scores_by_category[easy_cat] + for easy_cat in (Category.RETRIEVE, Category.COMPARE, Category.CALC_CHANGE)))} / ' + f'{(se := sum(CAT_DISTRIB[easy_cat] + for easy_cat in (Category.RETRIEVE, Category.COMPARE, Category.CALC_CHANGE)))} ' + f'= {e / se:.1%}'), + + 'HARD': (f'{(h := sum(cumu_consistency_scores_by_category[hard_cat] + for hard_cat in (Category.CALC_COMPLEX, Category.CALC_AND_JUDGE, + Category.EXPLAIN_FACTORS, Category.OTHER_ADVANCED)))} / ' + f'{(sh := sum(CAT_DISTRIB[hard_cat] + for hard_cat in (Category.CALC_COMPLEX, Category.CALC_AND_JUDGE, + Category.EXPLAIN_FACTORS, Category.OTHER_ADVANCED)))} ' + f'= {h / sh:.1%}') + }) + + print('\nINCORRECT:') + pprint(incorrect_answer_fb_ids) + + +if __name__ == '__main__': + arg_parser = argparse.ArgumentParser() + + arg_parser.add_argument('answer_col', help='Name of the column containing answers to evaluate') + arg_parser.add_argument('--id', default='all', help='FinanceBench Case ID') + arg_parser.add_argument('--n-times', type=int, default=9, help='Number of times to evaluate') + + arg_parser.add_argument('--human-eval', dest='human_eval', action='store_true', help='Human Evaluation ON') + arg_parser.add_argument('--no-human-eval', dest='human_eval', action='store_false', help='Human Evaluation OFF') + arg_parser.set_defaults(human_eval=True) + + arg_parser.add_argument('--refresh', dest='refresh', action='store_true', help='Evaluation Refreshing ON') + arg_parser.add_argument('--no-refresh', dest='refresh', action='store_false', help='Evaluation Refreshing OFF') + arg_parser.set_defaults(refresh=True) + + arg_parser.add_argument('--debug', action='store_true', help='Debug by printing out prompts') + + args = arg_parser.parse_args() + + if 'all' in args.id.lower(): + eval_all(output_name=args.answer_col, refresh=args.refresh, n_times=args.n_times, human=args.human_eval, debug=args.debug) # noqa: E501 + + else: + logger.info( + eval_correctness(fb_id=args.id, + answer=read_csv(OUTPUT_FILE_PATH, index_col=FB_ID_COL_NAME).loc[args.id, args.answer_col], + output_name=args.answer_col, + n_times=args.n_times, human=args.human_eval, debug=args.debug)) diff --git a/examples/FinanceBench-Lite/ground-truths.yml b/examples/FinanceBench-Lite/ground-truths.yml new file mode 100644 index 000000000..7cc0d1fc3 --- /dev/null +++ b/examples/FinanceBench-Lite/ground-truths.yml @@ -0,0 +1,4608 @@ +financebench_id_03029: + sector: Industrials + + company: 3M + period: 2018 + doc-type: 10k + doc: 3M_2018_10K + + question-type: metrics-generated + question-reasoning: Information extraction + domain-question-num: '' + question: What is the FY2018 capital expenditure amount (in USD millions) for 3M? + Give a response to the question by relying on the details shown in the cash flow + statement. + + answer: $1577.00 + justification: 'The metric capital expenditures was directly extracted from the + company 10K. The line item name, as seen in the 10K, was: Purchases of property, + plant and equipment (PP&E).' + page(s)-0based: 59 + page(s): '60' + + category: 0-RETRIEVE + correctness: >- + the answer contains a quantity equivalent to or approximately equal to + 1577, 1577 million, 1.577 billion, + 1600, 1600 million or 1.6 billion + + +financebench_id_04672: + sector: Industrials + + company: 3M + period: 2018 + doc-type: 10k + doc: 3M_2018_10K + + question-type: metrics-generated + question-reasoning: Information extraction + domain-question-num: '' + question: 'Assume that you are a public equities analyst. Answer the following question + by primarily using information that is shown in the balance sheet: what is the + year end FY2018 net PPNE for 3M? Answer in USD billions.' + + answer: $8.70 + justification: "The metric ppne, net was directly extracted from the company 10K.\ + \ The line item name, as seen in the 10K, was: Property, plant and equipment â\x80\ + \x94 net." + page(s)-0based: 57 + page(s): '58' + + category: 0-RETRIEVE + correctness: >- + the answer contains a quantity equivalent to or approximately equal to + 8.738, 8.738 billion, 8738 million, + 8.7, 8.7 billion or 8700 million + + evaluator-unreliable: true + + +financebench_id_00499: + sector: Industrials + + company: 3M + period: 2022 + doc-type: 10k + doc: 3M_2022_10K + + question-type: domain-relevant + question-reasoning: Logical reasoning (based on numerical reasoning) + domain-question-num: dg06 + question: Is 3M a capital-intensive business based on FY2022 data? + + answer: 'No, the company is managing its CAPEX and Fixed Assets pretty efficiently, + which is evident from below key metrics: + + CAPEX/Revenue Ratio: 5.1% + + Fixed assets/Total Assets: 20% + + Return on Assets= 12.4%' + justification: 'CAPEX/Revenue + + Fixed Assets/Total Assets + + ROA=Net Income/Total Assets' + page(s)-0based: 47 + page(s): 48,50,52 + + category: 6-OTHER-ADVANCED + correctness: |- + the answer opines that 3M is actually managing capital assets efficiently, and justifies such opinion + by certain calculated financial ratio metric value(s) showing at least one of the following: + - Fixed Assets is not large as proportion of Total Assets; + - Capital Expenditure (CapEx) is not high relative to Revenue; and/or + - Return on (Total) Assets (RoA or RoTA) is quite good + + evaluator-unreliable: true + + +financebench_id_01226: + sector: Industrials + + company: 3M + period: 2022 + doc-type: 10k + doc: 3M_2022_10K + + question-type: domain-relevant + question-reasoning: Logical reasoning (based on numerical reasoning) OR Numerical + reasoning OR Logical reasoning + domain-question-num: dg17 + question: What drove operating margin change as of FY2022 for 3M? If operating margin + is not a useful metric for a company like this, then please state that and explain + why. + + answer: "Operating Margin for 3M in FY2022 has decreased by 1.7% primarily due to:\ + \ \n-Decrease in gross Margin\n-mostly one-off charges including Combat Arms Earplugs\ + \ litigation, impairment related to exiting PFAS manufacturing, costs related\ + \ to exiting Russia and divestiture-related restructuring\ncharges" + justification: '' + page(s)-0based: 26 + page(s): '27' + + category: 0-RETRIEVE + correctness: |- + the answer mentions at least 1 salient change among those discussed below: + + COST OF SALES: + Cost of sales, measured as a percent of sales, increased in 2022 when compared to the same period last year. + Increases were primarily due to 2022 special item costs for significant litigation from additional commitments + to address PFAS-related matters at 3M's Zwijndrecht, Belgium site, higher raw materials and logistics costs, + manufacturing productivity headwinds which were further magnified by the shutdown of certain operations in Belgium + and progress on restarting previously-idled operations, and investments in growth, productivity and sustainability. + On a percent of sales basis, these increases were partially offset by increases in selling prices. + + SELLING, GENERAL AND ADMINISTRATIVE EXPENSES: + SG&A, measured as a percent of sales, increased in 2022 when compared to the same period last year. + SG&A was impacted by increased special item costs for significant litigation primarily related to steps toward + resolving Combat Arms Earplugs litigation resulting in a 2022 second quarter pre-tax charge of approximately $1.2 billion, + certain impairment costs related to exiting PFAS manufacturing, costs related to exiting Russia, + divestiture-related restructuring charges, and continued investment in key growth initiatives. + These increases were partially offset by restructuring benefits and ongoing general 3M cost management. + + RESEARCH, DEVELOPMENT AND RELATED EXPENSES: + R&D, measured as a percent of sales, decreased in 2022 when compared to the same period last year. + 3M continues to invest in a range of R&D activities from application development, product and manufacturing support, + product development and technology development aimed at disruptive innovations. + + GAIN ON BUSINESS DIVESTITURES: + In the third quarter of 2022, 3M recorded a pre-tax gain of $2.7 billion ($2.7 billion after tax) + related to the split-off and combination of its Food Safety business with Neogen Corporation. + + GOODWILL IMPAIRMENT EXPENSE: + As a result of 3M's commitment to exit per- and polyfluoroalkyl substance (PFAS) manufacturing, + 3M recorded a goodwill impairment charge related to the Advanced Materials reporting unit + (within the Transportation and Electronics business). + + +financebench_id_01865: # tricky: Total Sales Change contains zero Acquisitions but non-zero Divestitures + sector: Industrials + + company: 3M + period: 2022 + doc-type: 10k + doc: 3M_2022_10K + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: If we exclude the impact of M&A, which segment has dragged down 3M's overall + growth in 2022? + + answer: The consumer segment shrunk by 0.9% organically. + justification: '' + page(s)-0based: 24 + page(s): '25' + + category: 1-COMPARE + correctness: >- + the answer identifies Consumer segment as negative contributor + + +financebench_id_00807: + sector: Industrials + + company: 3M + period: 2023 + doc-type: 10q + doc: 3M_2023Q2_10Q + + question-type: domain-relevant + question-reasoning: Logical reasoning (based on numerical reasoning) OR Logical + reasoning + domain-question-num: dg01 + question: Does 3M have a reasonably healthy liquidity profile based on its quick + ratio for Q2 of FY2023? If the quick ratio is not relevant to measure liquidity, + please state that and explain why. + + answer: No. The quick ratio for 3M was 0.96 by Jun'23 close, which needs a bit of + an improvement to touch the 1x mark + justification: 'Quick Ratio= (Total current assets-Total inventories)/Total current + liabilities + + (15,754-5,280)/10,936' + page(s)-0based: 4 + page(s): '5' + + category: 4-CALC-AND-JUDGE + correctness: >- + the answer contains a calculated Quick Ratio decimal value that is over 0.75 but less than 1.00, + or, alternatively, a calculated percentage value that is over 75% but less than 100% + + +financebench_id_00941: + sector: Industrials + + company: 3M + period: 2023 + doc-type: 10q + doc: 3M_2023Q2_10Q + + question-type: domain-relevant + question-reasoning: Information extraction + domain-question-num: dg04 + question: Which debt securities are registered to trade on a national securities + exchange under 3M's name as of Q2 of 2023? + + answer: 'Following debt securities registered under 3M''s name are listed to trade + on the New York Stock Exchange: + + -1.500% Notes due 2026 (Trading Symbol: MMM26) + + -1.750% Notes due 2030 (Trading Symbol: MMM30) + + -1.500% Notes due 2031 (Trading Symbol: MMM31)' + justification: '' + page(s)-0based: 0 + page(s): '1' + + category: 0-RETRIEVE + correctness: >- + the answer mentions notes/securities due 2026, 2030 and 2031 + + evaluator-unreliable: true + + +financebench_id_01858: + sector: Industrials + + company: 3M + period: 2023 + doc-type: 10q + doc: 3M_2023Q2_10Q + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: Does 3M maintain a stable trend of dividend distribution? + + answer: Yes, not only they distribute the dividends on a routine basis, 3M has also + been increasing the per share dividend for consecutive 65 years + justification: '' + page(s)-0based: 61 + page(s): '62' + + category: 0-RETRIEVE + correctness: >- + the answer affirms that dividends have been stable, and/or mentions "65 years", "65th year" or something similar + + evaluator-unreliable: true + + +financebench_id_02987: + sector: Communication Services + + company: Activision Blizzard + period: 2019 + doc-type: 10k + doc: ACTIVISIONBLIZZARD_2019_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: 'What is the FY2019 fixed asset turnover ratio for Activision Blizzard? + Fixed asset turnover ratio is defined as: FY2019 revenue / (average PP&E between + FY2018 and FY2019). Round your answer to two decimal places. Base your judgments + on the information provided primarily in the statement of income and the statement + of financial position.' + + answer: '24.26' + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Total revenue. This metric was located in the 10K as a single line item + named: Total net revenues. + + + Metric 2: Ppne, net. This metric was located in the 10K as a single line item + named: Property and equipment, net.' + page(s)-0based: 68 + page(s): 69,70 + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated Fixed Asset Turnover Ratio decimal value that is in the range from 23.00 to 25.00 + (if the answer is a single number, assume that it is that calculated Fixed Asset Turnover Ratio decimal value) + + evaluator-unreliable: true + + +financebench_id_07966: + sector: Communication Services + + company: Activision Blizzard + period: 2019 + doc-type: 10k + doc: ACTIVISIONBLIZZARD_2019_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: What is the FY2017 - FY2019 3 year average of capex as a % of revenue + for Activision Blizzard? Answer in units of percents and round to one decimal + place. Calculate (or extract) the answer from the statement of income and the + cash flow statement. + + answer: 1.9% + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Capital expenditures. This metric was located in the 10K as a single + line item named: Capital expenditures. + + + Metric 2: Total revenue. This metric was located in the 10K as a single line item + named: Total net revenues.' + page(s)-0based: 69 + page(s): 70,73 + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated percentage value that is in the range from 1.70% to 2.10%, + or, alternatively, a calculated decimal value that is in the range from 0.0170 to 0.0210 + (if the answer is a single number, assume that it is that calculated metric value) + + evaluator-unreliable: true + + +financebench_id_04735: + sector: Information Technology + + company: Adobe + period: 2015 + doc-type: 10k + doc: ADOBE_2015_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: 'You are an investment banker and your only resource(s) to answer the + following question is (are): the statement of financial position and the cash + flow statement. Here''s the question: what is the FY2015 operating cash flow ratio + for Adobe? Operating cash flow ratio is defined as: cash from operations / total + current liabilities. Round your answer to two decimal places.' + + answer: '0.66' + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Cash from operations. This metric was located in the 10K as a single + line item named: Net cash provided by operating activities. + + + Metric 2: Total current liabilities. This metric was located in the 10K as a single + line item named: Total current liabilities.' + page(s)-0based: 58 + page(s): 59,63 + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated Operating Cash Flow Ratio decimal value that is in the range from 0.6000 to 0.7000, + or, alternatively, a calculated percentage value that is in the range from 60.00% to 70.00% + (if the answer is a single number, assume that it is that calculated Operating Cash Flow Ratio metric value) + + +financebench_id_07507: + sector: Information Technology + + company: Adobe + period: 2016 + doc-type: 10k + doc: ADOBE_2016_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: What is Adobe's year-over-year change in unadjusted operating income from + FY2015 to FY2016 (in units of percents and round to one decimal place)? Give a + solution to the question by using the income statement. + + answer: 65.4% + justification: 'The metric unadjusted operating income was directly extracted from + the company 10K. The line item name, as seen in the 10K, was: Operating income. + The final step was to execute the desired percent change calculation on unadjusted + operating income.' + page(s)-0based: 61 + page(s): '62' + + category: 2-CALC-CHANGE + correctness: >- + the answer contains a calculated Operating Income change percentage value that is in the range from 60.0% or 70.0% + (if the answer is a single number, assume that it is that calculated Operating Income change percentage value) + + +financebench_id_03856: + sector: Information Technology + + company: Adobe + period: 2017 + doc-type: 10k + doc: ADOBE_2017_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: 'What is the FY2017 operating cash flow ratio for Adobe? Operating cash + flow ratio is defined as: cash from operations / total current liabilities. Round + your answer to two decimal places. Please utilize information provided primarily + within the balance sheet and the cash flow statement.' + + answer: '0.83' + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Cash from operations. This metric was located in the 10K as a single + line item named: Net cash provided by operating activities. + + + Metric 2: Total current liabilities. This metric was located in the 10K as a single + line item named: Total current liabilities.' + page(s)-0based: 56 + page(s): 57,61 + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated Operating Cash Flow Ratio decimal value that is in the range from 0.8000 to 0.8500, + or, alternatively, a calculated percentage value that is in the range from 80.00% to 85.00% + (if the answer is a single number, assume that it is that calculated Operating Cash Flow Ratio metric value) + + +financebench_id_00438: + sector: Information Technology + + company: Adobe + period: 2022 + doc-type: 10k + doc: ADOBE_2022_10K + + question-type: domain-relevant + question-reasoning: Numerical reasoning OR information extraction + domain-question-num: dg14 + question: Does Adobe have an improving operating margin profile as of FY2022? If + operating margin is not a useful metric for a company like this, then state that + and explain why. + + answer: No the operating margins of Adobe have recently declined from 36.8% in FY + 2021 to 34.6% in FY2022. A drop by 2.2% in a year. + justification: '6098/16388 + + 5802/14573' + page(s)-0based: 53 + page(s): '54' + + category: 4-CALC-AND-JUDGE + correctness: >- + the answer contains calculated Operating Margin percentage or decimal values for 2021 and 2022, + and concludes that such metric decreased + + evaluator-unreliable: true + + +financebench_id_00591: + sector: Information Technology + + company: Adobe + period: 2022 + doc-type: 10k + doc: ADOBE_2022_10K + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: Does Adobe have an improving Free cashflow conversion as of FY2022? + + answer: Yes, the FCF conversion (using net income as the denominator) for Adobe + has improved by ~13% from 143% in 2021 to 156% in 2022 + justification: 'FCF Conversion: (Net cash provided by operating activities - Purchases + of property and equipment)/Net income + + (7838-442)/4756 + + (7230-348)/4822' + page(s)-0based: 56 + page(s): '57' + + category: 4-CALC-AND-JUDGE + correctness: >- + the answer contains calculated Free Cash Flow Conversion Ratio percentage or decimal values for 2021 and 2022, + and concludes that such metric increased + + evaluator-unreliable: true + + +financebench_id_01319: + sector: Utilities + + company: AES Corporation + period: 2022 + doc-type: 10k + doc: AES_2022_10K + + question-type: domain-relevant + question-reasoning: Information extraction + domain-question-num: dg21 + question: What is the quantity of restructuring costs directly outlined in AES Corporation's + income statements for FY2022? If restructuring costs are not explicitly outlined + then state 0. + + answer: '0' + justification: '' + page(s)-0based: 131 + page(s): '132' + + category: 0-RETRIEVE + correctness: >- + the answer states 0, zero, and/or that restructuring costs are not explicitly mentioned/reported + + evaluator-unreliable: true + + +financebench_id_00540: + sector: Utilities + + company: AES Corporation + period: 2022 + doc-type: 10k + doc: AES_2022_10K + + question-type: domain-relevant + question-reasoning: Numerical reasoning OR Logical reasoning + domain-question-num: dg25 + question: Roughly how many times has AES Corporation sold its inventory in FY2022? + Calculate inventory turnover ratio for the FY2022; if conventional inventory management + is not meaningful for the company then state that and explain why. + + answer: AES has converted inventory 9.5 times in FY 2022. + justification: 'Cost of sales/Inventory + + 10069/1055' + page(s)-0based: 129 + page(s): 130,132 + + category: 3-CALC-COMPLEX + correctness: |- + the answer contains a calculated Inventory Turnover Ratio (or Inventory Conversion Ratio) decimal value that is either: + - in the range from 9.0 to 10.0 times (implicitly using ending Inventory as denominator), or + - approximately 12.0 times (implicitly using average Inventory as denominator) + (if the answer is a single number, assume that it is that calculated Inventory Turnover Ratio decimal value) + + +financebench_id_10420: + sector: Utilities + + company: AES Corporation + period: 2022 + doc-type: 10k + doc: AES_2022_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: 'Based on the information provided primarily in the statement of financial + position and the statement of income, what is AES''s FY2022 return on assets (ROA)? + ROA is defined as: FY2022 net income / (average total assets between FY2021 and + FY2022). Round your answer to two decimal places.' + answer: '-0.02' + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Net income. This metric was located in the 10K as a single line item + named: NET INCOME (LOSS) ATTRIBUTABLE TO THE AES CORPORATION. + + + Metric 2: Total assets. This metric was located in the 10K as a single line item + named: TOTAL ASSETS.' + page(s)-0based: 129 + page(s): 130,132 + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated Return on Assets (RoA) + percentage value that is NEGATIVE and in the range from -2.00% to -1.40%, + or, alternatively, a calculated decimal value that is NEGATIVE and in the range from -0.0200 to -0.0140 + (if the answer is a single number, assume that it is that calculated Return on Assets (RoA) metric value) + + evaluator-unreliable: true + + +financebench_id_06655: + sector: Consumer Discretionary + + company: Amazon + period: 2017 + doc-type: 10k + doc: AMAZON_2017_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: 'What is Amazon''s FY2017 days payable outstanding (DPO)? DPO is defined + as: 365 * (average accounts payable between FY2016 and FY2017) / (FY2017 COGS + + change in inventory between FY2016 and FY2017). Round your answer to two decimal + places. Address the question by using the line items and information shown within + the balance sheet and the P&L statement.' + + answer: '93.86' + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Accounts payable. This metric was located in the 10K as a single line + item named: Accounts payable. + + + Metric 2: Inventories. This metric was located in the 10K as a single line item + named: Inventories. + + + Metric 3: Cost of goods sold. This metric was located in the 10K as a single line + item named: Cost of sales.' + page(s)-0based: 37 + page(s): 38,40 + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated Days Payable Outstanding (DPO) decimal value that is in the range from 90.00 to 100.00 + (if the answer is a single number, assume that it is that calculated Days Payable Outstanding (DPO) metric value) + + +financebench_id_08135: + sector: Consumer Discretionary + + company: Amazon + period: 2017 + doc-type: 10k + doc: AMAZON_2017_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: What is Amazon's year-over-year change in revenue from FY2016 to FY2017 + (in units of percents and round to one decimal place)? Calculate what was asked + by utilizing the line items clearly shown in the statement of income. + + answer: 30.8% + justification: 'The metric total revenue was directly extracted from the company + 10K. The line item name, as seen in the 10K, was: Total net sales. The final step + was to execute the desired percent change calculation on total revenue.' + page(s)-0based: 37 + page(s): '38' + + category: 2-CALC-CHANGE + correctness: >- + the answer contains a calculated Revenue change percentage value that is in the range from 30.0% to 31.0% + (if the answer is a single number, assume that it is that calculated Revenue change percentage value) + + +financebench_id_08286: + sector: Consumer Discretionary + + company: Amazon + period: 2019 + doc-type: 10k + doc: AMAZON_2019_10K + + question-type: metrics-generated + question-reasoning: Information extraction + domain-question-num: '' + question: By drawing conclusions from the information stated only in the income + statement, what is Amazon's FY2019 net income attributable to shareholders (in + USD millions)? + + answer: $11588.00 + justification: 'The metric net income was directly extracted from the company 10K. + The line item name, as seen in the 10K, was: Net income.' + page(s)-0based: 37 + page(s): '38' + + category: 0-RETRIEVE + correctness: >- + the answer contains a quantity equivalent to or approximately equal to + 11588, 11588 million, 11.588 billion, + 11600, 11600 million or 11.6 billion + + +financebench_id_03882: + sector: Materials + + company: Amcor + period: 2020 + doc-type: 10k + doc: AMCOR_2020_10K + + question-type: metrics-generated + question-reasoning: Information extraction + domain-question-num: '' + question: What is Amcor's year end FY2020 net AR (in USD millions)? Address the + question by adopting the perspective of a financial analyst who can only use the + details shown within the balance sheet. + + answer: $1616.00 + justification: 'The metric accounts receivable, net was directly extracted from + the company 10K. The line item name, as seen in the 10K, was: Trade receivables, + net.' + page(s)-0based: 49 + page(s): '50' + + category: 0-RETRIEVE + correctness: >- + the answer contains a quantity equivalent to or approximately equal to + 1615.9, 1615.9 million, + 1616, 1616 million, 1.616 billion, + 1600, 1600 million or 1.6 billion + + evaluator-unreliable: true + + +financebench_id_01935: + sector: Materials + + company: Amcor + period: 2022 + doc-type: 8k + doc: AMCOR_2022_8K_dated-2022-07-01 + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: What was the key agenda of the AMCOR's 8k filing dated 1st July 2022? + + answer: Amcor Finance (USA), Inc. and Amcor Flexibles North America, Inc., entered + into supplemental indentures relating to Guaranteed Senior Notes due 2026 and + 2028. This involved the substitution of the Substitute Issuer (Amcor Flexibles + North America) for the Former Issuer (Amcor Finance) and the assumption of covenants + under the indentures. (In essence a novation agreement) + justification: '' + page(s)-0based: 1 + page(s): '2' + + category: 0-RETRIEVE + correctness: >- + the answer mentions on of the terms "supplemental", "indendure(s)", "substitute" or "substitution" + + evaluator-unreliable: true + + +financebench_id_00799: + sector: Materials + + company: Amcor + period: 2023 + doc-type: 10k + doc: AMCOR_2023_10K + + question-type: domain-relevant + question-reasoning: Numerical reasoning OR Logical reasoning + domain-question-num: dg02 + question: Has AMCOR's quick ratio improved or declined between FY2023 and FY2022? + If the quick ratio is not something that a financial analyst would ask about a + company like this, then state that and explain why. + + answer: The quick ratio has slightly improved from 0.67 times to 0.69 times between + FY 2023 and FY 2022.(3.4% jump) + justification: 'Quick Ratio= (Total current assets-(Raw materials and supplies+Work + in process and finished goods))/Total current liabilities + + (5308-992-1221)/4476 + + (5853-1114-1325)/5103' + page(s)-0based: 51 + page(s): '52' + + category: 4-CALC-AND-JUDGE + correctness: >- + the answer contains calculated Quick Ratio decimal or percentage values for 2022 and 2023, + both over 0.50 but less than 0.75 (if decimal), or, alternatively, over 50% but less than 75% (if percentage); + the answer then concludes that such metric increased + + +financebench_id_01079: + sector: Materials + + company: Amcor + period: 2023 + doc-type: 10k + doc: AMCOR_2023_10K + + question-type: domain-relevant + question-reasoning: Information extraction + domain-question-num: dg10 + question: What are major acquisitions that AMCOR has done in FY2023, FY2022 and + FY2021? + + answer: 'Amcor completed these acquisitions during FY2023: + + -100% equity interest of a flexibles manufacturing company in the Czech Republic + + - 100% equity interest in a medical device packaging manufacturing site in + + Shanghai, China. + + -acquisition of a New Zealand-based leading manufacturer of state-of-the-art, + automated protein + + packaging machines.' + justification: '' + page(s)-0based: 63 + page(s): '64' + + category: 0-RETRIEVE + correctness: |- + the answer mentions acquisitions in at least 2 of the following: + - Czech Republic; + - New Zealand; and + - Shanghai, China (or, alternatively, just "Shanghai" or just "China") + + +financebench_id_01148: + sector: Materials + + company: Amcor + period: 2023 + doc-type: 10k + doc: AMCOR_2023_10K + + question-type: domain-relevant + question-reasoning: Information extraction OR Logical reasoning OR + domain-question-num: dg12 + question: What industry does AMCOR primarily operate in? + + answer: Amcor is a global leader in packaging production for various use cases. + justification: '' + page(s)-0based: 4 + page(s): '5' + + category: 0-RETRIEVE + correctness: >- + the answer mentions "packaging" + + +financebench_id_00684: + sector: Materials + + company: Amcor + period: 2023 + doc-type: 10k + doc: AMCOR_2023_10K + + question-type: domain-relevant + question-reasoning: Numerical reasoning OR information extraction + domain-question-num: dg13 + question: Does AMCOR have an improving gross margin profile as of FY2023? If gross + margin is not a useful metric for a company like this, then state that and explain + why. + + answer: No. For AMCOR there has been a slight decline in gross margins by 0.8%. + justification: 'Gross Profit/Net Sales + + 2725/14694 + + 2820/14544' + page(s)-0based: 49 + page(s): '50' + + category: 4-CALC-AND-JUDGE + correctness: >- + the answer contains calculated Gross Margin percentage or decimal values for 2022 and 2023, + and concludes that such metric decreased + answer-inadequate: true + + +financebench_id_01936: + sector: Materials + + company: Amcor + period: 2023 + doc-type: 10q + doc: AMCOR_2023Q2_10Q + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: What is the nature & purpose of AMCOR's restructuring liability as oF + Q2 of FY2023 close? + + answer: 87% of the total restructuring liability is related Employee liabilities. + justification: '' + page(s)-0based: 14 + page(s): '15' + + category: 0-RETRIEVE + correctness: |- + the answer mentions Employee costs or Employee liabilities + + +financebench_id_01928: + sector: Materials + + company: Amcor + period: 2023 + doc-type: Earnings + doc: AMCOR_2023Q4_EARNINGS + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: What Was AMCOR's Adjusted Non GAAP EBITDA for FY 2023 + + answer: AMCOR's Adj. EBITDA was $2,018mn in FY 2023 + justification: '' + page(s)-0based: 11 + page(s): '12' + + category: 0-RETRIEVE + correctness: >- + the answer contains a quantity equivalent to or approximately equal to + 2018 million, 2.018 billion, + 2000 million or 2 billion + + evaluator-unreliable: true + + +financebench_id_01930: + sector: Materials + + company: Amcor + period: 2023 + doc-type: Earnings + doc: AMCOR_2023Q4_EARNINGS + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: How much was the Real change in Sales for AMCOR in FY 2023 vs FY 2022, + if we exclude the impact of FX movement, passthrough costs and one-off items? + + answer: The Real Growth was flat in FY 2023 vs FY 2022. + justification: '' + page(s)-0based: 9 + page(s): '10' + + category: 2-CALC-CHANGE + correctness: >- + the answer concludes that the percentage change was approximately 1%, + or, alternatively, concludes that the growth was flat / small + + evaluator-unreliable: true + + +financebench_id_03069: + sector: Information Technology + + company: AMD + period: 2015 + doc-type: 10k + doc: AMD_2015_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: Answer the following question as if you are an equity research analyst + and have lost internet connection so you do not have access to financial metric + providers. According to the details clearly outlined within the P&L statement + and the statement of cash flows, what is the FY2015 depreciation and amortization + (D&A from cash flow statement) % margin for AMD? + + answer: 4.2% + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Depreciation and amortization. This metric was located in the 10K as + a single line item named: Depreciation and amortization. + + + Metric 2: Total revenue. This metric was located in the 10K as a single line item + named: Net revenue.' + page(s)-0based: 55 + page(s): 56,60 + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated Depreciation & Amortization (D&A) Margin (using Net Revenue as denominator) + percentage value that is in the range from 4.00% to 4.50%, + or, alternatively, a calculated decimal value that is in the range from 0.0400 to 0.0450 + (if the answer is a single number, assume that it is that calculated Depreciation & Amortization (D&A) Margin metric value) + + +financebench_id_00222: + sector: Information Technology + + company: AMD + period: 2022 + doc-type: 10k + doc: AMD_2022_10K + + question-type: domain-relevant + question-reasoning: Logical reasoning (based on numerical reasoning) OR Logical + reasoning + domain-question-num: dg01 + question: Does AMD have a reasonably healthy liquidity profile based on its quick + ratio for FY22? If the quick ratio is not relevant to measure liquidity, please + state that and explain why. + + answer: Yes. The quick ratio is 1.57, calculated as (cash and cash equivalents+Short + term investments+Accounts receivable, net+receivables from related parties)/ (current + liabilities). + justification: '' + page(s)-0based: 55 + page(s): '56' + + category: 4-CALC-AND-JUDGE + correctness: >- + the answer contains a calculated Quick Ratio decimal value that is in the range from 1.40 to 1.90, + or, alternatively, a calculated percentage value that is in the range from 140% to 190% + + +financebench_id_00995: + sector: Information Technology + + company: AMD + period: 2022 + doc-type: 10k + doc: AMD_2022_10K + + question-type: domain-relevant + question-reasoning: Information extraction + domain-question-num: dg07 + question: What are the major products and services that AMD sells as of FY22? + + answer: AMD sells server microprocessors (CPUs) and graphics processing units (GPUs), + data processing units (DPUs), Field Programmable Gate Arrays (FPGAs), and Adaptive + System-on-Chip (SoC) products for data centers; CPUs, accelerated processing units + (APUs) that integrate CPUs and GPUs, and chipsets for desktop and notebook personal + computers; discrete GPUs, and semi-custom SoC products and development services; + and embedded CPUs, GPUs, APUs, FPGAs, and Adaptive SoC products. + justification: '' + page(s)-0based: 3 + page(s): '4' + + category: 0-RETRIEVE + correctness: >- + the answer mentions at least graphics (i.e., GPU) and FPGA products + + evaluator-unreliable: true + + +financebench_id_01198: + sector: Information Technology + + company: AMD + period: 2022 + doc-type: 10k + doc: AMD_2022_10K + + question-type: domain-relevant + question-reasoning: Information extraction + domain-question-num: dg15 + question: What drove revenue change as of the FY22 for AMD? + + answer: In 2022, AMD reported Higher sales of their EPYC server processors, higher + semi-custom product sales, and the inclusion of Xilinx embedded product sales + justification: '' + page(s)-0based: 42 + page(s): '43' + + category: 0-RETRIEVE + correctness: |- + the answer mentions at least 2 of the following: + - "Data Center" and/or "EPYC"; + - "Gaming" and/or "semi-custom"; and + - "Embedded" and/or "Xilinx" + + evaluator-unreliable: true + + +financebench_id_00917: + sector: Information Technology + + company: AMD + period: 2022 + doc-type: 10k + doc: AMD_2022_10K + + question-type: domain-relevant + question-reasoning: Logical reasoning (based on numerical reasoning) OR Numerical + reasoning OR Logical reasoning + domain-question-num: dg17 + question: What drove operating margin change as of the FY22 for AMD? If operating + margin is not a useful metric for a company like this, then please state that + and explain why. + + answer: The decrease in AMD's operating income was primarily driven by amortization + of intangible assets associated with the Xilinx acquisition + justification: '' + page(s)-0based: 42 + page(s): '43' + + category: 0-RETRIEVE + correctness: >- + the answer mentions Xilinx + + +financebench_id_01279: + sector: Information Technology + + company: AMD + period: 2022 + doc-type: 10k + doc: AMD_2022_10K + + question-type: domain-relevant + question-reasoning: Numerical reasoning + domain-question-num: dg19 + question: Among operations, investing, and financing activities, which brought in + the most (or lost the least) cash flow for AMD in FY22? + + answer: In 2022, AMD brought in the most cashflow from Operations + justification: '' + page(s)-0based: 57 + page(s): '58' + + category: 1-COMPARE + correctness: >- + the answer identifies Operations / Operating Cash Flows as bringing in most cash + + +financebench_id_00563: + sector: Information Technology + + company: AMD + period: 2022 + doc-type: 10k + doc: AMD_2022_10K + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: From FY21 to FY22, excluding Embedded, in which AMD reporting segment + did sales proportionally increase the most? + + answer: Data Center + justification: "Data center: \nFY22: 6,043\nFY21: 3,694 \n6,043/3,694-1 = 63,59%\n\ + \nClient: \nFY22: 6,201\nFY21: 6,887 \n6,201/6,887-1 = -9,96%\n\n\nGaming: \n\ + FY22: 6,805\nFY21: 5,607 \n6,805/5,607-1 = 21,37%" + page(s)-0based: 47 + page(s): '48' + + category: 1-COMPARE + correctness: >- + the answer identifies Data Center segment as proportionally growing most strongly + + +financebench_id_00757: + sector: Information Technology + + company: AMD + period: 2022 + doc-type: 10k + doc: AMD_2022_10K + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: Did AMD report customer concentration in FY22? + + answer: Yes, one customer accounted for 16% of consolidated net revenue + justification: One customer ccounting for 16% of net evenue is a high customer concenration + page(s)-0based: 11 + page(s): '12' + + category: 0-RETRIEVE + correctness: >- + the answer mentions that one or a small number of customers + accounted for large portion of revenue + + evaluator-unreliable: true + + +financebench_id_00476: + sector: Financials + + company: American Express + period: 2022 + doc-type: 10k + doc: AMERICANEXPRESS_2022_10K + + question-type: domain-relevant + question-reasoning: Information extraction + domain-question-num: dg04 + question: Which debt securities are registered to trade on a national securities + exchange under American Express' name as of 2022? + + answer: There are none + justification: No debt securities are listed under the securities registered pursuant + to Section 12(b) of the Act, which implies there are none + page(s)-0based: 0 + page(s): '1' + + category: 0-RETRIEVE + correctness: >- + the answer concludes that there are no debt securities traded, + or, alternatively, that no such debt securities are explicitly reported + + evaluator-unreliable: true + + +financebench_id_01028: + sector: Financials + + company: American Express + period: 2022 + doc-type: 10k + doc: AMERICANEXPRESS_2022_10K + + question-type: domain-relevant + question-reasoning: Information extraction + domain-question-num: dg08 + question: What are the geographies that American Express primarily operates in as + of 2022? + + answer: United States, EMEA, APAC, and LACC + justification: '' + page(s)-0based: 154 + page(s): '155' + + category: 0-RETRIEVE + correctness: |- + the answer mentions at least 3 among: + - United States (US); + - Europe, the Middle East and Africa (EMEA); + - Asia Pacific, Australia and New Zealand (APAC); and + - Latin America, Canada and the Caribbean (LACC) + + +financebench_id_00723: + sector: Financials + + company: American Express + period: 2022 + doc-type: 10k + doc: AMERICANEXPRESS_2022_10K + + question-type: domain-relevant + question-reasoning: Numerical reasoning OR information extraction + domain-question-num: dg14 + question: Does AMEX have an improving operating margin profile as of 2022? If operating + margin is not a useful metric for a company like this, then state that and explain + why. + + answer: Performance is not measured through operating margin + justification: It's a financial services company and performance is measured through + the Net Interest Margin. + page(s)-0based: 95 + page(s): '96' + + category: 6-OTHER-ADVANCED + correctness: >- + the answer argues that Operating Margin is not a very relevant/useful metric for this business model and/or industry, + or, alternatively, that performance in this business model and/or industry is usually not judged through Operating Margin + + evaluator-unreliable: true + + +financebench_id_00720: + sector: Financials + + company: American Express + period: 2022 + doc-type: 10k + doc: AMERICANEXPRESS_2022_10K + + question-type: domain-relevant + question-reasoning: Logical reasoning (based on numerical reasoning) OR Numerical + reasoning OR Logical reasoning + domain-question-num: dg16 + question: What drove gross margin change as of the FY2022 for American Express? + If gross margin is not a useful metric for a company like this, then please state + that and explain why. + + answer: Performance is not measured through gross margin + justification: It's a financial services company and performance is measured through + the Net Interest Margin. + page(s)-0based: 95 + page(s): '96' + + category: 6-OTHER-ADVANCED + correctness: >- + the answer argues that Gross Margin is not a very relevant/useful metric for this business model and/or industry, + or, alternatively, that performance in this business model and/or industry is usually not judged through Gross Margin + + evaluator-unreliable: true + + +financebench_id_01351: + sector: Financials + + company: American Express + period: 2022 + doc-type: 10k + doc: AMERICANEXPRESS_2022_10K + + question-type: domain-relevant + question-reasoning: Numerical reasoning + domain-question-num: dg23 + question: How much has the effective tax rate of American Express changed between + FY2021 and FY2022? + + answer: The effective tax rate for American Express has changed/dropped from 24.6% + in FY 2021 to 21.6% in FY 2022. + justification: '' + page(s)-0based: 43 + page(s): '44' + + category: 2-CALC-CHANGE + correctness: >- + the answer says Effective Tax Rate changed from 24.6% to 21.6%, + and/or that it decreased by 3 pencentage points or 3% + + evaluator-unreliable: true + + +financebench_id_01964: + sector: Financials + + company: American Express + period: 2022 + doc-type: 10k + doc: AMERICANEXPRESS_2022_10K + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: What was the largest liability in American Express's Balance Sheet in + 2022? + + answer: Customer deposits + justification: '' + page(s)-0based: 97 + page(s): '98' + + category: 1-COMPARE + correctness: >- + the answer identifies Customer Deposits as largest liability + + evaluator-unreliable: true + + +financebench_id_01981: + sector: Financials + + company: American Express + period: 2022 + doc-type: 10k + doc: AMERICANEXPRESS_2022_10K + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: Was American Express able to retain card members during 2022? + + answer: 'Yes' + justification: '' + page(s)-0based: 44 + page(s): '45' + + category: 0-RETRIEVE + correctness: >- + the answer affirms that retention was good/high + + evaluator-unreliable: true + + +financebench_id_05718: + sector: Utilities + + company: American Water Works + period: 2020 + doc-type: 10k + doc: AMERICANWATERWORKS_2020_10K + + question-type: metrics-generated + question-reasoning: Information extraction + domain-question-num: '' + question: How much (in USD billions) did American Water Works pay out in cash dividends + for FY2020? Compute or extract the answer by primarily using the details outlined + in the statement of cash flows. + + answer: $0.40 + justification: 'The metric total cash dividends paid out was directly extracted + from the company 10K. The line item name, as seen in the 10K, was: Dividends paid.' + page(s)-0based: 85 + page(s): '86' + + category: 0-RETRIEVE + correctness: >- + the answer contains a quantity equivalent to or approximately equal to + 0.389, 0.389 billion, 389 million, + 0.4, 0.4 billion or 400 million + + +financebench_id_04254: + sector: Utilities + + company: American Water Works + period: 2021 + doc-type: 10k + doc: AMERICANWATERWORKS_2021_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: Basing your judgments off of the cash flow statement and the income statement, + what is American Water Works's FY2021 unadjusted operating income + depreciation + and amortization from the cash flow statement (unadjusted EBITDA) in USD millions? + + answer: $1832.00 + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Depreciation and amortization. This metric was located in the 10K as + a single line item named: Depreciation and amortization. + + + Metric 2: Unadjusted operating income. This metric was located in the 10K as a + single line item named: Operating income.' + page(s)-0based: 85 + page(s): 86,88 + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a quantity equivalent to or approximately equal to + 1832, 1832 million, 1.832 billion, + 1800, 1800 million or 1.8 billion + + +financebench_id_00070: + sector: Utilities + + company: American Water Works + period: 2022 + doc-type: 10k + doc: AMERICANWATERWORKS_2022_10K + + question-type: domain-relevant + question-reasoning: Numerical reasoning OR Logical reasoning + domain-question-num: dg24 + question: Does American Water Works have positive working capital based on FY2022 + data? If working capital is not a useful or relevant metric for this company, + then please state that and explain why. + + answer: No, American Water Works had negative working capital of -$1561M in FY 2022. + justification: 'Accounts receivable+Income tax receivable+Unbilled revenues+Materials + and supplies+other-Accounts payable-Accrued liabilities-Accrued taxes + + 334+114+275+98+312-254-706-49' + page(s)-0based: 80 + page(s): 81,82 + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated (Net) Working Capital metric value in dollars + that is NEGATIVE and equivalent to or approximately equal to + minus/negative 1561, minus/negative 1561 million, minus/negative 1.561 billion, + minus/negative 1600, minus/negative 1600 million or minus/negative 1.6 billion + + evaluator-unreliable: true + + +financebench_id_02608: + sector: Consumer Discretionary + + company: Best Buy + period: 2017 + doc-type: 10k + doc: BESTBUY_2017_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: In agreement with the information outlined in the income statement, what + is the FY2015 - FY2017 3 year average net profit margin (as a %) for Best Buy? + Answer in units of percents and round to one decimal place. + + answer: 2.8% + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Total revenue. This metric was located in the 10K as a single line item + named: Revenue. + + + Metric 2: Net income. This metric was located in the 10K as a single line item + named: Net earnings attributable to Best Buy Co., Inc. shareholders.' + page(s)-0based: 55 + page(s): '56' + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated Average Net Profit Margin percentage value that is in the range from 2.50% to 3.00%, + or, alternatively, a calculated decimal value that is in the range from 0.0250 to 0.0300 + (if the answer is a single number, assume that it is that calculated Average Net Profit Margin metric value) + + +financebench_id_04417: + sector: Consumer Discretionary + + company: Best Buy + period: 2019 + doc-type: 10k + doc: BESTBUY_2019_10K + + question-type: metrics-generated + question-reasoning: Information extraction + domain-question-num: '' + question: What is the year end FY2019 total amount of inventories for Best Buy? + Answer in USD millions. Base your judgments on the information provided primarily + in the balance sheet. + + answer: $5409.00 + justification: 'The metric inventories was directly extracted from the company 10K. + The line item name, as seen in the 10K, was: Merchandise inventories.' + page(s)-0based: 51 + page(s): '52' + + category: 0-RETRIEVE + correctness: >- + the answer contains a quantity equivalent to or approximately equal to + 5409, 5409 million, 5.409 billion, + 5400, 5400 million or 5.4 billion + + +financebench_id_00685: + sector: Consumer Discretionary + + company: Best Buy + period: 2023 + doc-type: 10k + doc: BESTBUY_2023_10K + + question-type: domain-relevant + question-reasoning: Logical reasoning (based on numerical reasoning) OR Logical + reasoning + domain-question-num: dg03 + question: Are Best Buy's gross margins historically consistent (not fluctuating + more than roughly 2% each year)? If gross margins are not a relevant metric for + a company like this, then please state that and explain why. + + answer: Yes, the margins have been consistent, there has been a minor decline of + 1.1% in gross margins between FY2022 and FY2023. + justification: 'Gross Profit/Revenue + + 9912/46298 + + 11640/51761' + page(s)-0based: 39 + page(s): '40' + + category: 4-CALC-AND-JUDGE + correctness: >- + the answer contains calculated Gross Margin + percentage values for 2022 and 2023 that are within 2 percentage points (or 2%) of each other, + or, alternatively, calculated decimal values that are within 0.02 of each other + answer-inadequate: true + + +financebench_id_01077: + sector: Consumer Discretionary + + company: Best Buy + period: 2023 + doc-type: 10k + doc: BESTBUY_2023_10K + + question-type: domain-relevant + question-reasoning: Information extraction + domain-question-num: dg10 + question: What are major acquisitions that Best Buy has done in FY2023, FY2022 and + FY2021? + + answer: 'Best Buy closed two acquisitions, both these companies were already partially + owned by Best Buy, but Best Buy acquired all outstanding shares of these two companies + during FY 2022: (1) Current Health Ltd and (2) Two Peaks, LLC d/b/a Yardbird Furniture' + justification: '' + page(s)-0based: 50 + page(s): '51' + + category: 0-RETRIEVE + correctness: >- + the answer mentions Current Health and Two Peaks (which is also alternatively called Yardbird) + + +financebench_id_01275: + sector: Consumer Discretionary + + company: Best Buy + period: 2023 + doc-type: 10k + doc: BESTBUY_2023_10K + + question-type: domain-relevant + question-reasoning: Numerical reasoning + domain-question-num: dg19 + question: Among operations, investing, and financing activities, which brought in + the most (or lost the least) cash flow for Best Buy in FY2023? + + answer: Best Buy generated the most cash flow from operating activities in FY 2023 + ($1.8 bn) + justification: '' + page(s)-0based: 41 + page(s): '42' + + category: 1-COMPARE + correctness: >- + the answer identifies that Operations / Operating Cash Flows as bringing in most cash + + +financebench_id_00288: + sector: Consumer Discretionary + + company: Best Buy + period: 2024 + doc-type: 10q + doc: BESTBUY_2024Q2_10Q + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: Was there any drop in Cash & Cash equivalents between FY 2023 and Q2 of + FY2024? + + answer: Yes, there was a decline of ~42% between FY2023 and Q2 of FY 2024. + justification: 1093/1874-1 + page(s)-0based: 19 + page(s): '20' + + category: 1-COMPARE + correctness: >- + the answer affirms that Cash & Cash Equivalents decreased + + +financebench_id_00460: + sector: Consumer Discretionary + + company: Best Buy + period: 2024 + doc-type: 10q + doc: BESTBUY_2024Q2_10Q + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: Was there any change in the number of Best Buy stores between Q2 of FY2024 + and FY2023? + + answer: Yes, there is decline in number stores by 1.32% from 982 stores in Q2 FY + 2023 to 969 by the end of Q2 FY2024. + justification: 969/982-1 + page(s)-0based: 16 + page(s): '17' + + category: 1-COMPARE + correctness: >- + the answer mentions that number of stores decreased + + evaluator-unreliable: true + + +financebench_id_01902: + sector: Consumer Discretionary + + company: Best Buy + period: 2024 + doc-type: 10q + doc: BESTBUY_2024Q2_10Q + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: Which Best Buy product category performed the best (by top line) in the + domestic (USA) Market during Q2 of FY2024? + + answer: The entertainment segment experienced the highest growth of 9% during Q2 + FY2024, primarily from gaming division. + justification: '' + page(s)-0based: 17 + page(s): '18' + + category: 1-COMPARE + correctness: |- + the answer either: + - identifies Entertainment (or Gaming) category/segment as proportionally growing most; or + - identifies Computing and Mobile Phones category/segment as having highest revenue + + evaluator-unreliable: true + + +financebench_id_04660: + sector: Information Technology + + company: Block + period: 2016 + doc-type: 10k + doc: BLOCK_2016_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: Considering the data in the balance sheet, what is Block's (formerly known + as Square) FY2016 working capital ratio? Define working capital ratio as total + current assets divided by total current liabilities. Round your answer to two + decimal places. + + answer: '1.73' + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Total current liabilities. This metric was located in the 10K as a single + line item named: Total current liabilities. + + + Metric 2: Total current assets. This metric was located in the 10K as a single + line item named: Total current assets.' + page(s)-0based: 67 + page(s): '68' + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated Working Capital Ratio decimal value that is in the range from 1.70 to 1.80, + or, alternatively, a calculated percentage value that is in the range from 170% to 180% + (if the answer is a single number, assume that it is that calculated Working Capital Ratio metric value) + + +financebench_id_03838: + sector: Information Technology + + company: Block + period: 2020 + doc-type: 10k + doc: BLOCK_2020_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: What is the FY2019 - FY2020 total revenue growth rate for Block (formerly + known as Square)? Answer in units of percents and round to one decimal place. + Approach the question asked by assuming the standpoint of an investment banking + analyst who only has access to the statement of income. + + answer: 101.5% + justification: 'The metric total revenue was directly extracted from the company + 10K. The line item name, as seen in the 10K, was: Total net revenue. The final + step was to execute the desired percent change calculation on total revenue.' + page(s)-0based: 85 + page(s): '86' + + category: 2-CALC-CHANGE + correctness: >- + the answer contains a calculated Revenue growth percentage value that is over 100.0% + (if the answer is a single number, assume that it is that calculated Revenue growth percentage value) + + evaluator-unreliable: true + + +financebench_id_07661: + sector: Information Technology + + company: Block + period: 2020 + doc-type: 10k + doc: BLOCK_2020_10K + + question-type: metrics-generated + question-reasoning: Information extraction + domain-question-num: '' + question: 'Using the cash flow statement, answer the following question to the best + of your abilities: how much did Block (formerly known as Square) generate in cash + flow from operating activities in FY2020? Answer in USD millions.' + + answer: $382.00 + justification: 'The metric cash from operations was directly extracted from the + company 10K. The line item name, as seen in the 10K, was: Net cash provided by + operating activities.' + page(s)-0based: 89 + page(s): '90' + + category: 0-RETRIEVE + correctness: >- + the answer contains a quantity equivalent to or approximately equal to + 381.6, 381.6 million, 0.3816 billion, + 382, 382 million, 0.382 billion, + 400, 400 million or 0.4 billion + + +financebench_id_10285: + sector: Industrials + + company: Boeing + period: 2018 + doc-type: 10k + doc: BOEING_2018_10K + + question-type: metrics-generated + question-reasoning: Information extraction + domain-question-num: '' + question: 'We need to calculate a financial metric by using information only provided + within the balance sheet. Please answer the following question: what is Boeing''s + year end FY2018 net property, plant, and equipment (in USD millions)?' + + answer: $12645.00 + justification: 'The metric ppne, net was directly extracted from the company 10K. + The line item name, as seen in the 10K, was: Property, plant and equipment, net.' + page(s)-0based: 51 + page(s): '52' + + category: 0-RETRIEVE + correctness: >- + the answer contains a quantity equivalent to or approximately equal to + 12645, 12645 million, 12.645 billion, + 12600, 12600 million or 12.6 billion + + evaluator-unreliable: true + + +financebench_id_00517: + sector: Industrials + + company: Boeing + period: 2022 + doc-type: 10k + doc: BOEING_2022_10K + + question-type: domain-relevant + question-reasoning: Logical reasoning (based on numerical reasoning) + domain-question-num: dg09 + question: Are there any product categories / service categories that represent more + than 20% of Boeing's revenue for FY2022? + + answer: Yes. Boeing has product and service categories that represent more than + 20% of Boeing's revenue for FY2022. These categories are Commercial Airplanes + which comprises 39% of total revenue, Defence which comprises 35% of total revenue + and Services which comprises 26% of total revenue. + justification: 'Commercial Airplanes%=Revenues: Commercial Airplanes/Total revenues*100=25,867/66,608*100=39%. + Defence%=Defense, Space & Security/Total revenues*100=23,162/66,608*100=35%. Services%=Global + Services/Total revenues*100=17,611/66,608*100=26%.' + page(s)-0based: 61 + page(s): '62' + + category: 3-CALC-COMPLEX + correctness: |- + the answer mentions at least 1 of following categories: + - Commercial Airplanes; + - Defense/Defence (or fully written "Defense, Space & Security"); and + - Services (or fully written "Global Services") + + evaluator-unreliable: true + + +financebench_id_01091: + sector: Industrials + + company: Boeing + period: 2022 + doc-type: 10k + doc: BOEING_2022_10K + + question-type: domain-relevant + question-reasoning: Information extraction + domain-question-num: dg11 + question: Has Boeing reported any materially important ongoing legal battles from + FY2022? + + answer: Yes. Multiple lawsuits have been filed against Boeing resulting from a 2018 + Lion Air crash and a 2019 Ethiopian Airlines crash. + justification: '' + page(s)-0based: 112 + page(s): '113' + + category: 0-RETRIEVE + correctness: >- + the answer affirms that there have been material lawsuits / legal battles + + evaluator-unreliable: true + + +financebench_id_00678: # note: Gross Income is implicit, with missing label + sector: Industrials + + company: Boeing + period: 2022 + doc-type: 10k + doc: BOEING_2022_10K + + question-type: domain-relevant + question-reasoning: Numerical reasoning OR information extraction + domain-question-num: dg13 + question: Does Boeing have an improving gross margin profile as of FY2022? If gross + margin is not a useful metric for a company like this, then state that and explain + why. + + answer: Yes. Boeing has an improving gross margin profile as of FY2022. Gross profit + improved from $3,017 million in FY2021 to $3,502 million in FY2022. Gross margin + % improved from 4.8% in FY2021 to 5.3% in FY2022. + justification: Gross margin%=Gross margin/Total revenues*100=3,502/66,608*100=5.3% + for 2022 and 3,017/62,286*100=4.8% for 2021. + page(s)-0based: 54 + page(s): '55' + + category: 4-CALC-AND-JUDGE + correctness: >- + the answer contains calculated Gross Margin percentage or decimal values for 2021 and 2022, + and concludes that such metric increased + + evaluator-unreliable: true + + +financebench_id_01290: + sector: Industrials + + company: Boeing + period: 2022 + doc-type: 10k + doc: BOEING_2022_10K + + question-type: domain-relevant + question-reasoning: Information extraction OR Logical reasoning + domain-question-num: dg20 + question: Who are the primary customers of Boeing as of FY2022? + + answer: Boeing's primary customers as of FY2022 are a limited number of commercial + airlines and the US government. The US government accounted for 40% of Boeing's + total revenues in FY2022. + justification: '' + page(s)-0based: 7 + page(s): 8, 10, 14 + + category: 0-RETRIEVE + correctness: >- + the answer mentions airlines and government(s) / military(ies) + + evaluator-unreliable: true + + +financebench_id_00464: + sector: Industrials + + company: Boeing + period: 2022 + doc-type: 10k + doc: BOEING_2022_10K + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: Is Boeing's business subject to cyclicality? + + answer: Yes, Boeing's business is subject to cyclicality due to its exposure to + the airline industry which is a cyclical industry. + justification: A major portion of Boeing's revenue is derived from the sale of aircraft + to commercial airlines. The commercial airlines business is cyclical, and subject + to significant profit swings. + page(s)-0based: 7 + page(s): '8' + + category: 0-RETRIEVE + correctness: >- + the answer affirms that cyclicality is present + + +financebench_id_00494: + sector: Industrials + + company: Boeing + period: 2022 + doc-type: 10k + doc: BOEING_2022_10K + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: What production rate changes is Boeing forecasting for FY2023? + + answer: Boeing forecasts an increase in the production rates for the 737, 777X and + 787 aircrafts in 2023. + justification: Boeing plans to gradually increase production rates for the 737 and + 787 and to resume production of 777X. + page(s)-0based: 8 + page(s): '9' + + category: 0-RETRIEVE + correctness: >- + the answer mentions increase(s) in production rate(s) + + +financebench_id_00585: # note: correct number signs + sector: Industrials + + company: Boeing + period: 2022 + doc-type: 10k + doc: BOEING_2022_10K + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: How does Boeing's effective tax rate in FY2022 compare to FY2021? + + answer: Effective tax rate in FY2022 was 0.62%, compared to -14.76% in FY2021. + justification: Effective tax rate=Income tax (expense) benefit/ Loss before income + taxes*100=(31)/(5,022)*100=0.62% in 2022 and 743/(5,033)*100=-14.76%. + page(s)-0based: 54 + page(s): '55' + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains calculated Effective Tax Rate percentage or decimal values for 2021 and 2022, + with one value being negative and the other value being positive + + evaluator-unreliable: true + + +financebench_id_03473: + sector: Consumer Staples + + company: Coca-Cola + period: 2017 + doc-type: 10k + doc: COCACOLA_2017_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: 'What is the FY2017 return on assets (ROA) for Coca Cola? ROA is defined + as: FY2017 net income / (average total assets between FY2016 and FY2017). Round + your answer to two decimal places. Give a response to the question by relying + on the details shown in the balance sheet and the P&L statement.' + + answer: '0.01' + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Net income. This metric was located in the 10K as a single line item + named: NET INCOME ATTRIBUTABLE TO SHAREOWNERS OF THE COCA-COLA COMPANY. + + + Metric 2: Total assets. This metric was located in the 10K as a single line item + named: TOTAL ASSETS.' + page(s)-0based: 73 + page(s): 74,76 + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated Return on Assets (RoA) percentage value that is in the range from 0.90% to 2.00%, + or, alternatively, a calculated decimal value that is in the range from 0.0090 to 0.0200 + (if the answer is a single number, assume that it is that calculated Return on Assets (RoA) metric value) + + evaluator-unreliable: true + + +financebench_id_09724: + sector: Consumer Staples + + company: Coca-Cola + period: 2021 + doc-type: 10k + doc: COCACOLA_2021_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: What is Coca Cola's FY2021 COGS % margin? Calculate what was asked by + utilizing the line items clearly shown in the income statement. + + answer: 39.7% + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Cost of goods sold. This metric was located in the 10K as a single line + item named: Cost of goods sold. + + + Metric 2: Total revenue. This metric was located in the 10K as a single line item + named: Net Operating Revenues.' + page(s)-0based: 61 + page(s): '62' + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated Cost of Goods Sold (COGS) Margin + percentage value that is in the range from 38.00% to 42.00%, + or, alternatively, a calculated decimal value that is in the range from 0.3800 to 0.4200 + (if the answer is a single number, assume that it is that calculated Cost of Goods Sold (COGS) Margin metric value) + + +financebench_id_06272: + sector: Consumer Staples + + company: Coca-Cola + period: 2022 + doc-type: 10k + doc: COCACOLA_2022_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: What is Coca Cola's FY2022 dividend payout ratio (using total cash dividends + paid and net income attributable to shareholders)? Round answer to two decimal + places. Answer the question asked by assuming you only have access to information + clearly displayed in the cash flow statement and the income statement. + + answer: '0.8' + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Total cash dividends paid out. This metric was located in the 10K as + a single line item named: Dividends. + + + Metric 2: Net income. This metric was located in the 10K as a single line item + named: Net Income Attributable to Shareowners of The Coca-Cola Company.' + page(s)-0based: 62 + page(s): 63,66 + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated Dividend Payout Ratio decimal value that is in the range from 0.7800 to 0.8200, + or, alternatively, a calculated percentage value that is in the range from 78.00% to 82.00% + (if the answer is a single number, assume that it is that calculated Dividend Payout Ratio metric value) + + evaluator-unreliable: true + + +financebench_id_10130: + sector: Information Technology + + company: Corning + period: 2020 + doc-type: 10k + doc: CORNING_2020_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: 'Based on the information provided primarily in the balance sheet and + the statement of income, what is FY2020 days payable outstanding (DPO) for Corning? + DPO is defined as: 365 * (average accounts payable between FY2019 and FY2020) + / (FY2020 COGS + change in inventory between FY2019 and FY2020). Round your answer + to two decimal places.' + + answer: '63.86' + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Accounts payable. This metric was located in the 10K as a single line + item named: Accounts payable. + + + Metric 2: Inventories. This metric was located in the 10K as a single line item + named: Inventories, net (Note 6). + + + Metric 3: Cost of goods sold. This metric was located in the 10K as a single line + item named: Cost of sales.' + page(s)-0based: 69 + page(s): 70,72 + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated Days Payable Outstanding (DPO) decimal value that is in the range from 60.00 to 70.00 + (if the answer is a single number, assume that it is that calculated Days Payable Outstanding (DPO) decimal value) + + +financebench_id_02981: + sector: Information Technology + + company: Corning + period: 2021 + doc-type: 10k + doc: CORNING_2021_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: Taking into account the information outlined in the income statement, + what is the FY2019 - FY2021 3 year average unadjusted operating income % margin + for Corning? Answer in units of percents and round to one decimal place. + + answer: 10.3% + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Unadjusted operating income. This metric was located in the 10K as a + single line item named: Operating income. + + + Metric 2: Total revenue. This metric was located in the 10K as a single line item + named: Net sales.' + page(s)-0based: 64 + page(s): '65' + + category: 3-CALC-COMPLEX + correctness: >- + the answer constains a calculated Average Operating Income Margin percentage value that is in the range from 9.00% to 12.00%, + or, alternatively, a calculated decimal value that is in the range from 0.0900 to 0.1200 + (if the answer is a single number, assume that it is that calculated Average Operating Income Margin metric value) + + evaluator-unreliable: true + + +financebench_id_01346: + sector: Information Technology + + company: Corning + period: 2022 + doc-type: 10k + doc: CORNING_2022_10K + + question-type: domain-relevant + question-reasoning: Numerical reasoning + domain-question-num: dg23 + question: How much has the effective tax rate of Corning changed between FY2021 + and FY2022? + + answer: The effective tax rate of Corning has changed from 20% in FY2021 to 23% + in FY 2022. + justification: '' + page(s)-0based: 23 + page(s): '24' + + category: 2-CALC-CHANGE + correctness: >- + the answer says that Effective Tax Rate changed + from approximately 20.2% (or 20%) to approximately 22.9% (or 23%), + and/or that it increased by approximately 2.6, 2.7 or 3 percentage points + (or 2.6%, 2.7%, or 3%) + + evaluator-unreliable: true + + +financebench_id_00005: + sector: Information Technology + + company: Corning + period: 2022 + doc-type: 10k + doc: CORNING_2022_10K + + question-type: domain-relevant + question-reasoning: Numerical reasoning OR Logical reasoning + domain-question-num: dg24 + question: Does Corning have positive working capital based on FY2022 data? If working + capital is not a useful or relevant metric for this company, then please state + that and explain why. + + answer: Yes. Corning had a positive working capital amount of $831 million by FY + 2022 close. This answer considers only operating current assets and current liabilities + that were clearly shown in the balance sheet. + justification: 'Trade accounts receivable, net of doubtful accounts+Inventories+Other + current assets-Accounts payable-Other accrued liabilities + + 1721+2904+1157-1804-3147' + page(s)-0based: 59 + page(s): '60' + + category: 3-CALC-COMPLEX + correctness: >- + the answer affirms that Working Capital is/was positive, + proving so by a calculated Working Capital metric value that is positive + + +financebench_id_04209: + sector: Consumer Staples + + company: Costco + period: 2021 + doc-type: 10k + doc: COSTCO_2021_10K + + question-type: metrics-generated + question-reasoning: Information extraction + domain-question-num: '' + question: Using only the information within the balance sheet, how much total assets + did Costco have at the end of FY2021? Answer in USD millions. + + answer: $59268.00 + justification: 'The metric total assets was directly extracted from the company + 10K. The line item name, as seen in the 10K, was: TOTAL ASSETS.' + page(s)-0based: 37 + page(s): '38' + + category: 0-RETRIEVE + correctness: >- + the answer contains a quantity equivalent to or approximately equal to + 59268, 59268 million, 59.268 billion, + 59300, 59300 million, 59.3 billion + 59000, 59000 million or 59 billion + + +financebench_id_05915: + sector: Health Care + + company: CVS Health + period: 2018 + doc-type: 10k + doc: CVSHEALTH_2018_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: 'What is the FY2018 fixed asset turnover ratio for CVS Health? Fixed asset + turnover ratio is defined as: FY2018 revenue / (average PP&E between FY2017 and + FY2018). Round your answer to two decimal places. Calculate what was asked by + utilizing the line items clearly shown in the P&L statement and the balance sheet.' + + answer: '17.98' + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Total revenue. This metric was located in the 10K as a single line item + named: Total revenues. + + + Metric 2: Ppne, net. This metric was located in the 10K as a single line item + named: Property and equipment, net.' + page(s)-0based: 301 + page(s): 302,304 + + category: 3-CALC-COMPLEX + correctness: >- + the answer constains a calculated Fixed Asset Turnover Ratio decimal value that is in the range from 17.00 to 19.00 + (if the answer is a single number, assume that it is that calculated Fixed Asset Turnover Ratio decimal value) + + evaluator-unreliable: true + + +financebench_id_00790: + sector: Health Care + + company: CVS Health + period: 2022 + doc-type: 10k + doc: CVSHEALTH_2022_10K + + question-type: domain-relevant + question-reasoning: Logical reasoning (based on numerical reasoning) + domain-question-num: dg06 + question: Is CVS Health a capital-intensive business based on FY2022 data? + + answer: Yes, CVS Health requires an extensive asset base to operate, which is evident + from its ROA of only 1.82% in 2022 and 3.39% in 2021, though it should be noted + that a significant portion of this asset base is goodwill, and CVS's fixed assets/total + assets ratio is on the lower side of 5.6%. + justification: 'Property and equipment, net/Total Assets + + 12873/228275 + + + ROA=Net Income/Total Assets + + 4165/228275 + + 7898/232999' + page(s)-0based: 107 + page(s): 108,110 + + category: 6-OTHER-ADVANCED + correctness: |- + the answer either: + - mentions that a calculated Return on Assets (RoA) metric value is quite low (which suggests capital intensity); or + - mentions that Fixed Assets form only a small proportion of Total Assets (which suggests the reverse) + + evaluator-unreliable: true + + +financebench_id_01107: + sector: Health Care + + company: CVS Health + period: 2022 + doc-type: 10k + doc: CVSHEALTH_2022_10K + + question-type: domain-relevant + question-reasoning: Information extraction + domain-question-num: dg11 + question: Has CVS Health reported any materially important ongoing legal battles + from 2022, 2021 and 2020? + + answer: "Yes, CVS Health has been involved in multiple ongoing legal battles. Some\ + \ notable legal dispute areas for CVS are: (1) usual and customary pricing litigation:\ + \ where it's claimed that CVSâ\x80\x99s retail pharmacies overcharged for prescription\ + \ drugs; (2) PBM litigation and investigations: where it's claimed that that rebate\ + \ agreements between the drug manufacturers and PBMs caused inflated prices for\ + \ certain drug products; and (3) controlled substances litigation: legal matters\ + \ around opioids for which CVS has agreed to pay up to $4.3 billion to claimants\ + \ in remediation and $625 million to attorneys and fees" + justification: '' + page(s)-0based: 172 + page(s): 173,173,174 + + category: 0-RETRIEVE + correctness: >- + the answer affirms that there have been material lawsuits / legal battles + + +financebench_id_01244: + sector: Health Care + + company: CVS Health + period: 2022 + doc-type: 10k + doc: CVSHEALTH_2022_10K + + question-type: domain-relevant + question-reasoning: Information extraction + domain-question-num: dg18 + question: Has CVS Health paid dividends to common shareholders in Q2 of FY2022? + + answer: Yes, CVS paid a $ 0.55 dividend per share every quarter in FY2022 + justification: '' + page(s)-0based: 67 + page(s): '68' + + category: 0-RETRIEVE + correctness: >- + the answer affirms that dividends have been / were paid + + +financebench_id_00839: + sector: Consumer Discretionary + + company: Foot Locker + period: 2022 + doc-type: 8k + doc: FOOTLOCKER_2022_8K_dated_2022-08-19 + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: Does Foot Locker's new CEO have previous CEO experience in a similar company + to Footlocker? + + answer: Yes. She was previous CEO of Ulta Beauty which means she had to manage a + large retail company that has brick and mortar + online business. So yes she was + a CEO in a similar company to Foot Locker before this. + justification: '' + page(s)-0based: 1 + page(s): '2' + + category: 0-RETRIEVE + correctness: >- + the answer affirms that Dillon has got experience in relevant and similar organizations and roles + + evaluator-unreliable: true + + +financebench_id_00822: + sector: Consumer Discretionary + + company: Foot Locker + period: 2022 + doc-type: 8k + doc: FOOTLOCKER_2022_8K_dated-2022-05-20 + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: Were there any board member nominees who had substantially more votes + against joining than the other nominees? + + answer: Yes, his name is Richard A. Johnson + justification: Richard A. Johnson had roughly 16.1 million votes against him joining + whereas the maximum votes against joining among all other candidates was roughly + 6.1 million. + page(s)-0based: 1 + page(s): '2' + + category: 1-COMPARE + correctness: >- + the answer identifies Johnson as receiving many votes against + + evaluator-unreliable: true + + +financebench_id_04103: + sector: Consumer Staples + + company: General Mills + period: 2019 + doc-type: 10k + doc: GENERALMILLS_2019_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: 'What is the FY2019 cash conversion cycle (CCC) for General Mills? CCC + is defined as: DIO + DSO - DPO. DIO is defined as: 365 * (average inventory between + FY2018 and FY2019) / (FY2019 COGS). DSO is defined as: 365 * (average accounts + receivable between FY2018 and FY2019) / (FY2019 Revenue). DPO is defined as: 365 + * (average accounts payable between FY2018 and FY2019) / (FY2019 COGS + change + in inventory between FY2018 and FY2019). Round your answer to two decimal places. + Address the question by using the line items and information shown within the + income statement and the balance sheet.' + + answer: '-3.7' + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Accounts payable. This metric was located in the 10K as a single line + item named: Accounts payable. + + + Metric 2: Accounts receivable, net. This metric was located in the 10K as a single + line item named: Receivables. + + + Metric 3: Cost of goods sold. This metric was located in the 10K as a single line + item named: Cost of sales. + + + Metric 4: Total revenue. This metric was located in the 10K as a single line item + named: Net sales. + + + Metric 5: Inventories. This metric was located in the 10K as a single line item + named: Inventories.' + page(s)-0based: 52 + page(s): 53,55 + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated Cash Conversion Cycle (CCC) metric value + that is NEGATIVE and in the range from -5.00 to -2.00, based on supporting calculated + Days Inventory Oustanding (DIO), Days Sales Outstanding (DSO) and Days Payable Outstanding (DPO) metric values + answer-inadequate: true + + +financebench_id_03471: + sector: Consumer Staples + + company: General Mills + period: 2020 + doc-type: 10k + doc: GENERALMILLS_2020_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: By drawing conclusions from the information stated only in the statement + of financial position, what is General Mills's FY2020 working capital ratio? Define + working capital ratio as total current assets divided by total current liabilities. + Round your answer to two decimal places. + + answer: '0.68' + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Total current liabilities. This metric was located in the 10K as a single + line item named: Total current liabilities. + + + Metric 2: Total current assets. This metric was located in the 10K as a single + line item named: Total current assets.' + page(s)-0based: 49 + page(s): '50' + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated Working Capital Ratio decimal value that is in the range from 0.6500 to 0.7000, + or, alternatively, a calculated percentage value that is in the range from 65.00% to 70.00% + (if the answer is a single number, assume that it is that calculated Working Capital Ratio metric value) + + +financebench_id_04854: + sector: Consumer Staples + + company: General Mills + period: 2020 + doc-type: 10k + doc: GENERALMILLS_2020_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: 'According to the information provided in the statement of cash flows, + what is the FY2020 free cash flow (FCF) for General Mills? FCF here is defined + as: (cash from operations - capex). Answer in USD millions.' + + answer: $3215.00 + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Cash from operations. This metric was located in the 10K as a single + line item named: Net cash provided by operating activities. + + + Metric 2: Capital expenditures. This metric was located in the 10K as a single + line item named: Purchases of land, buildings, and equipment.' + page(s)-0based: 51 + page(s): '52' + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated Free Cash Flows (FCF) metric value that is equivalent to or approximately equal to + 3215.4, 3215.4 million, 3.2154 billion, + 3215, 3215 million, 3.215 billion, + 3200, 3200 million or 3.2 billion + (if the answer is a single number, assume that it is that calculated Free Cash Flows (FCF) metric value) + + evaluator-unreliable: true + + +financebench_id_10136: + sector: Consumer Staples + + company: General Mills + period: 2022 + doc-type: 10k + doc: GENERALMILLS_2022_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: 'We want to calculate a financial metric. Please help us compute it by + basing your answers off of the cash flow statement and the income statement. Here''s + the question: what is the FY2022 retention ratio (using total cash dividends paid + and net income attributable to shareholders) for General Mills? Round answer to + two decimal places.' + + answer: '0.54' + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Total cash dividends paid out. This metric was located in the 10K as + a single line item named: Dividends paid. + + + Metric 2: Net income. This metric was located in the 10K as a single line item + named: Net earnings attributable to General Mills.' + page(s)-0based: 44 + page(s): 45,49 + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated Retention Ratio decimal value that is in the range from 0.5000 to 0.6000, + or, alternatively, a calculated percentage value that is in the range from 50.00% to 60.00% + (if the answer is a single number, assume that it is that calculated Retention Ratio metric value) + + +financebench_id_00956: + sector: Health Care + + company: Johnson & Johnson + period: 2022 + doc-type: 10k + doc: JOHNSON_JOHNSON_2022_10K + + question-type: domain-relevant + question-reasoning: Logical reasoning (based on numerical reasoning) + domain-question-num: dg05 + question: Are JnJ's FY2022 financials that of a high growth company? + + answer: No, JnJ's FY2022 financials are not of a high growth company as sales grew + by 1.3% in FY2022. + justification: '' + page(s)-0based: 27 + + category: 0-RETRIEVE + correctness: >- + the answer mentions low/slow Sales Revenue growth + + +financebench_id_00669: + sector: Health Care + + company: Johnson & Johnson + period: 2022 + doc-type: 10k + doc: JOHNSON_JOHNSON_2022_10K + + question-type: domain-relevant + question-reasoning: Logical reasoning (based on numerical reasoning) OR Numerical + reasoning OR Logical reasoning + domain-question-num: dg16 + question: What drove gross margin change as of FY2022 for JnJ? If gross margin is + not a useful metric for a company like this, then please state that and explain + why. + + answer: 'For FY22, JnJ had changes in gross margin due to: One-time COVID-19 vaccine + manufacturing exit related costs, Currency impacts in the Pharmaceutical segment, + Commodity inflation in the MedTech and Consumer Health segments, partially offset + by Supply chain benefits in the Consumer Health segment.' + justification: Gross margin change is equivalent to the increase in cost of products + sold as a percent to sales. + page(s)-0based: 33 + + category: 5-EXPLAIN-FACTORS + correctness: |- + the answer mentions at least 2 of following: + - one-time COVID-19 vaccine manufacturing exit related costs; + - currency impacts in the Pharmaceutical segment; + - commodity inflation in the MedTech and Consumer Health segments; and/or + - supply chain benefits in the Consumer Health segment + + evaluator-unreliable: true + + +financebench_id_00711: + sector: Health Care + + company: Johnson & Johnson + period: 2022 + doc-type: 10k + doc: JOHNSON_JOHNSON_2022_10K + + question-type: domain-relevant + question-reasoning: Numerical reasoning OR Logical reasoning + domain-question-num: dg25 + question: Roughly how many times has JnJ sold its inventory in FY2022? Calculate + inventory turnover ratio for FY2022; if conventional inventory management is not + meaningful for the company then state that and explain why. + + answer: JnJ sold its inventory 2.7 times in FY2022. + justification: Inventory turnover ratio = Cost of products sold/average inventories + = 31,089/((12,483+10,387)/2) = 2.7 + page(s)-0based: 45 + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated Inventory Turnover Ratio decimal value that is in the range from 2.00 to 3.00 + (if the answer is a single number, assume that it is that calculated Inventory Turnover Ratio decimal value) + + evaluator-unreliable: true + + +financebench_id_00651: # TODO: retrieve growth rates + sector: Health Care + + company: Johnson & Johnson + period: 2022 + doc-type: Earnings + doc: JOHNSON_JOHNSON_2022Q4_EARNINGS + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: Is growth in JnJ's adjusted EPS expected to accelerate in FY2023? + + answer: No, rate of growth in adjusted EPS is expected to decelerate slightly from + 3.6% in FY2022 to 3.5% in FY2023. + justification: FY2023 adjusted EPS growth of 3.5% is slightly lower than FY2022 + adjusted EPS growth of 3.6%. + page(s)-0based: 0 + + category: 1-COMPARE + correctness: >- + the answer mentions 3.5% and 3.6%, + or, alternatively, concludes that growth is NOT expected to accelerate + + evaluator-unreliable: true + + +financebench_id_01484: + sector: Health Care + + company: Johnson & Johnson + period: 2022 + doc-type: Earnings + doc: JOHNSON_JOHNSON_2022Q4_EARNINGS + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: How did JnJ's US sales growth compare to international sales growth in + FY2022? + + answer: US sales increased 3.0% vs international sales decline of 0.6%. + justification: '' + page(s)-0based: 1 + + category: 1-COMPARE + correctness: >- + the answer mentions US sales increased and international sales decreased + + evaluator-unreliable: true + + +financebench_id_01488: + sector: Health Care + + company: Johnson & Johnson + period: 2023 + doc-type: 8k + doc: JOHNSON_JOHNSON_2023_8K_dated-2023-08-30 + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: Which business segment of JnJ will be treated as a discontinued operation + from August 30, 2023 onward? + + answer: The Consumer Health business segment will be treated as a discontinued operation + from August 30, 2023 onward. + justification: '' + page(s)-0based: 3 + + category: 0-RETRIEVE + correctness: >- + the answer identifies Consumer Health as discontinued + + +financebench_id_01490: + sector: Health Care + + company: Johnson & Johnson + period: 2023 + doc-type: 8k + doc: JOHNSON_JOHNSON_2023_8K_dated-2023-08-30 + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: What is the amount of the gain accruing to JnJ as a result of the separation + of its Consumer Health business segment, as of August 30, 2023? + + answer: JnJ will make a gain of approximately $20 billion from the separation of + its Consumer Health business segment. + justification: '' + page(s)-0based: 3 + + category: 0-RETRIEVE + correctness: >- + the answer mentions 20 billion + + +financebench_id_01491: + sector: Health Care + + company: Johnson & Johnson + period: 2023 + doc-type: 8k + doc: JOHNSON_JOHNSON_2023_8K_dated-2023-08-30 + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: What is the amount of the cash proceeds that JnJ realised from the separation + of Kenvue (formerly Consumer Health business segment), as of August 30, 2023? + + answer: JnJ realised $13.2 billion in cash proceeds from the separation of Kenvue. + justification: '' + page(s)-0based: 3 + + category: 0-RETRIEVE + correctness: >- + the answer mentions 13.2 billion, or, alternatively, approximately 13 billion + + +financebench_id_01487: + sector: Health Care + + company: Johnson & Johnson + period: 2023 + doc-type: Earnings + doc: JOHNSON_JOHNSON_2023Q2_EARNINGS + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: Did JnJ's net earnings as a percent of sales increase in Q2 of FY2023 + compared to Q2 of FY2022? + + answer: Yes, net earnings as a percent of sales increased from 20% in Q2 of FY2022 + to 20.1% in Q2 of FY2023. + justification: '' + page(s)-0based: 9 + + category: 1-COMPARE + correctness: >- + the answer mentions 20.0% (or 20%) and 20.1%, or, alternatively, mentions a slight increase + + +financebench_id_00299: + sector: Financials + + company: JPMorgan + period: 2021 + doc-type: 10q + doc: JPMORGAN_2021Q1_10Q + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: Which of JPM's business segments had the lowest net revenue in 2021 Q1? + + answer: Corporate. Its net revenue was -$473 million. + justification: 14,605 > 12,517 > 4,077 > 2,393 > -473 + page(s)-0based: 18 + page(s): '19' + + category: 1-COMPARE + correctness: >- + the answer identifies Corporate segment as having lowest Net Revenue + + +financebench_id_02119: + sector: Financials + + company: JPMorgan + period: 2021 + doc-type: 10q + doc: JPMORGAN_2021Q1_10Q + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: If JPM went bankrupted by the end by 2021 Q1 and liquidated all of its + assets to pay its shareholders, how much could each shareholder get? + + answer: They could receive $66.56 per share. + justification: '' + page(s)-0based: 5 + page(s): '6' + + category: 0-RETRIEVE + correctness: >- + the answer contains a quantity that is in the range from 60.00 to 70.00 + + evaluator-unreliable: true + + +financebench_id_00206: + sector: Financials + + company: JPMorgan + period: 2022 + doc-type: 10k + doc: JPMORGAN_2022_10K + + question-type: domain-relevant + question-reasoning: Logical reasoning (based on numerical reasoning) OR Logical + reasoning + domain-question-num: dg03 + question: Are JPM's gross margins historically consistent (not fluctuating more + than roughly 2% each year)? If gross margins are not a relevant metric for a company + like this, then please state that and explain why. + + answer: Since JPM is a financial institution, gross margin is not a relevant metric. + justification: '' + page(s)-0based: 2 + page(s): '3' + + category: 6-OTHER-ADVANCED + correctness: >- + the answer argues that Gross Margin is not a very relevant/useful metric for this business model and/or industry, + or, alternatively, that performance in this business model and/or industry is usually not judged through Gross Margin + + evaluator-unreliable: true + + +financebench_id_00394: + sector: Financials + + company: JPMorgan + period: 2022 + doc-type: 10q + doc: JPMORGAN_2022Q2_10Q + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: In 2022 Q2, which of JPM's business segments had the highest net income? + + answer: Corporate & Investment Bank. Its net income was $3725 million. + justification: 3725 > 3100 > 1004 > 994 > -174 + page(s)-0based: 20 + page(s): '21' + + category: 1-COMPARE + correctness: >- + the answer identifies Corporate & Investment Bank segment as having higest Net Income + + +financebench_id_02049: + sector: Financials + + company: JPMorgan + period: 2023 + doc-type: 10q + doc: JPMORGAN_2023Q2_10Q + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: Looking at VaR, did the risk that JPM faced in the second fiscal quarter + of 2023 decrease compared to the same period in the prior year? + + answer: Yes. It decreased. + justification: '' + page(s)-0based: 84 + page(s): '85' + + category: 0-RETRIEVE + correctness: >- + the answer affirms that VaR decreased + + +financebench_id_10499: + sector: Consumer Staples + + company: Kraft Heinz + period: 2019 + doc-type: 10k + doc: KRAFTHEINZ_2019_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: 'What is Kraft Heinz''s FY2019 inventory turnover ratio? Inventory turnover + ratio is defined as: (FY2019 COGS) / (average inventory between FY2018 and FY2019). + Round your answer to two decimal places. Please base your judgments on the information + provided primarily in the balance sheet and the P&L statement.' + + answer: '6.25' + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Cost of goods sold. This metric was located in the 10K as a single line + item named: Cost of products sold. + + + Metric 2: Inventories. This metric was located in the 10K as a single line item + named: Inventories.' + page(s)-0based: 49 + page(s): 50,52 + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated Inventory Turnover Ratio decimal value that is in the range from 6.00 to 6.50 + (if the answer is a single number, assume that it is that calculated Inventory Turnover Ratio decimal value) + + +financebench_id_04412: + sector: Industrials + + company: Lockheed Martin + period: 2020 + doc-type: 10k + doc: LOCKHEEDMARTIN_2020_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: 'We need to calculate a reasonable approximation (or exact number if possible) + of a financial metric. Basing your judgment by information plainly provided in + the balance sheet and the P&L statement, what is Lockheed Martin''s FY2020 asset + turnover ratio? Asset turnover ratio is defined as: FY2020 revenue / (average + total assets between FY2019 and FY2020). Round your answer to two decimal places.' + + answer: '1.33' + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Total revenue. This metric was located in the 10K as a single line item + named: Total net sales. + + + Metric 2: Total assets. This metric was located in the 10K as a single line item + named: Total assets.' + page(s)-0based: 66 + page(s): 67,69 + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated Asset Turnover Ratio decimal value that is in the range from 1.30 to 1.40 + (if the answer is a single number, assume that it is that calculated Asset Turnover Ratio decimal value) + + +financebench_id_03031: + sector: Industrials + + company: Lockheed Martin + period: 2021 + doc-type: 10k + doc: LOCKHEEDMARTIN_2021_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: What is Lockheed Martin's FY2021 net working capital? Define net working + capital as total current assets less total current liabilities. Answer in USD + millions. Respond to the question by assuming the perspective of an investment + analyst who can only use the details shown within the balance sheet. + + answer: $5818.00 + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Total current liabilities. This metric was located in the 10K as a single + line item named: Total current liabilities. + + + Metric 2: Total current assets. This metric was located in the 10K as a single + line item named: Total current assets.' + page(s)-0based: 67 + page(s): '68' + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated Net Working Capital metric value that is equivalent to or approximately equal to + 5818, 5818 million, 5.818 billion, + 5800, 5800 million or 5.8 billion + (if the answer is a single number, assume that it is that calculated Net Working Capital metric value) + + +financebench_id_03718: + sector: Industrials + + company: Lockheed Martin + period: 2022 + doc-type: 10k + doc: LOCKHEEDMARTIN_2022_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: What is Lockheed Martin's 2 year total revenue CAGR from FY2020 to FY2022 + (in units of percents and round to one decimal place)? Provide a response to the + question by primarily using the statement of income. + + answer: 0.4% + justification: 'The metric total revenue was directly extracted from the company + 10K. The line item name, as seen in the 10K, was: Total net sales. The final step + was to execute the desired CAGR calculation on total revenue.' + page(s)-0based: 62 + page(s): '63' + + category: 2-CALC-CHANGE + correctness: >- + the answer contains a calculated CAGR percentage value that is in the range from 0.400% to 0.500% + (if the answer is a single number, assume that it is that calculated CAGR percentage value) + + evaluator-unreliable: true + + +financebench_id_04171: + sector: Consumer Discretionary + + company: MGM Resorts + period: 2018 + doc-type: 10k + doc: MGMRESORTS_2018_10K + + question-type: metrics-generated + question-reasoning: Information extraction + domain-question-num: '' + question: Basing your judgments off of the balance sheet, what is the year end FY2018 + amount of accounts payable for MGM Resorts? Answer in USD millions. + + answer: $303.00 + justification: 'The metric accounts payable was directly extracted from the company + 10K. The line item name, as seen in the 10K, was: Accounts payable.' + page(s)-0based: 56 + page(s): '57' + + category: 0-RETRIEVE + correctness: >- + the answer contains a quantity that is equivalent to or approximately equal to + 302.6, 302.6 million, 0.3026 billion, + 303, 303 million, 0.303 billion, + 300, 300 million or 0.3 billion + + evaluator-unreliable: true + + +financebench_id_03849: + sector: Consumer Discretionary + + company: MGM Resorts + period: 2020 + doc-type: 10k + doc: MGMRESORTS_2020_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: What is the FY2018 - FY2020 3 year average of capex as a % of revenue + for MGM Resorts? Answer in units of percents and round to one decimal place. Please + utilize information provided primarily within the statement of cash flows and + the statement of income. + + answer: 7.9% + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Capital expenditures. This metric was located in the 10K as a single + line item named: Capital expenditures, net of construction payable. + + + Metric 2: Total revenue. This metric was located in the 10K as a single line item + named: [blank line item referring to total revenue].' + page(s)-0based: 64 + page(s): 65,67 + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated metric percentage value that is in the range from 7.50% to 8.50%, + or, alternatively, a calculated decimal value that is in the range from 0.0750 to 0.0850 + (if the answer is a single number, assume that it is that calculated metric value) + + +financebench_id_01254: + sector: Consumer Discretionary + + company: MGM Resorts + period: 2022 + doc-type: 10k + doc: MGMRESORTS_2022_10K + + question-type: domain-relevant + question-reasoning: Information extraction + domain-question-num: dg18 + question: Has MGM Resorts paid dividends to common shareholders in FY2022? + + answer: Yes. MGM maintained 0.01$ per share annual dividend through out FY 2022. + justification: '' + page(s)-0based: 31 + page(s): '32' + + category: 0-RETRIEVE + correctness: >- + the answer affirms that dividends have been / were paid + + evaluator-unreliable: true + + +financebench_id_00382: + sector: Consumer Discretionary + + company: MGM Resorts + period: 2022 + doc-type: Earnings + doc: MGMRESORTS_2022Q4_EARNINGS + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: Which region had the Highest EBITDAR Contribution for MGM during FY2022? + + answer: Las Vegas resorts contributed ~90% of company level EBITDAR during FY2022. + justification: 3142308/3497254 + page(s)-0based: 12 + page(s): '13' + + category: 1-COMPARE + correctness: >- + the answer identifies Las Vegas resorts as having highest EBITDAR + + +financebench_id_01911: + sector: Consumer Discretionary + + company: MGM Resorts + period: 2022 + doc-type: Earnings + doc: MGMRESORTS_2022Q4_EARNINGS + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: What was MGM's interest coverage ratio using FY2022 Adjusted EBIT as the + numerator and annual Interest Expense as the denominator? + + answer: As adjusted EBIT is negative, coverage ratio is zero + justification: '' + page(s)-0based: 13 + page(s): '14' + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated Interest Coverage Ratio metric value, + or, alternatively, concludes that Interest Coverage Ratio is zero + answer-inadequate: true + + +financebench_id_01912: + sector: Consumer Discretionary + + company: MGM Resorts + period: 2022 + doc-type: Earnings + doc: MGMRESORTS_2022Q4_EARNINGS + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: Which region had the worst topline performance for MGM during FY2022? + + answer: MGM China experienced the worst topline performance amongst the other regions + presented. Its revenue declined 44% in FY2022 whereas the other regions presented + increased their revenues. + justification: '' + page(s)-0based: 2 + page(s): 3,4,4 + + category: 1-COMPARE + correctness: >- + the answer identifies MGM China as having worst top-line Revenue performance + + +financebench_id_00407: + sector: Consumer Discretionary + + company: MGM Resorts + period: 2023 + doc-type: 10q + doc: MGMRESORTS_2023Q2_10Q + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: Which type of debt received the largest investment among the short term + investments for MGM in H1 FY2023? + + answer: the biggest short term investment is in corporate bonds (almost 82% of the + total investment) + justification: 416420/509921 + page(s)-0based: 10 + page(s): '11' + + category: 1-COMPARE + correctness: >- + the answer identifies corporate bonds as having received largest short-term investment + + +financebench_id_04700: + sector: Information Technology + + company: Microsoft + period: 2016 + doc-type: 10k + doc: MICROSOFT_2016_10K + + question-type: metrics-generated + question-reasoning: Information extraction + domain-question-num: '' + question: What is the FY2016 COGS for Microsoft? Please state answer in USD millions. + Provide a response to the question by primarily using the statement of income. + + answer: $32780.00 + justification: 'The metric cost of goods sold was directly extracted from the company + 10K. The line item name, as seen in the 10K, was: Total cost of revenue.' + page(s)-0based: 51 + page(s): '52' + + category: 0-RETRIEVE + correctness: >- + the answer contains a quantity that is equivalent to or approximately equal to + 32780, 32780 million, 32.78 billion, + 32800, 32800 million, 32.8 billion + 33000, 33000 million or 33 billion + + +financebench_id_00552: + sector: Information Technology + + company: Microsoft + period: 2023 + doc-type: 10k + doc: MICROSOFT_2023_10K + + question-type: domain-relevant + question-reasoning: Numerical reasoning + domain-question-num: dg22 + question: Has Microsoft increased its debt on balance sheet between FY2023 and the + FY2022 period? + answer: No. Microsoft decreased its debt by $2.5bn in FY 2023 vs FY 2022. + justification: 'Current portion of long-term debt+Long-term debt + + 5247+41990 + + 2749+47032' + page(s)-0based: 59 + page(s): '60' + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains calculated Total Debt values for 2022 and 2023, and concludes that Total Debt decreased + answer-inadequate: true + + +financebench_id_04458: + sector: Communication Services + + company: Netflix + period: 2015 + doc-type: 10k + doc: NETFLIX_2015_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: 'We want to calculate a financial metric. Please help us compute it by + basing your answers off of the statement of income and the statement of cash flows. + Here''s the question: what is the FY2015 unadjusted EBITDA % margin for Netflix? + Calculate unadjusted EBITDA using unadjusted operating income and D&A (from cash + flow statement).' + + answer: 5.4% + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Depreciation and amortization. This metric was located in the 10K as + a single line item named: Depreciation and amortization of property, equipment + and intangibles. + + + Metric 2: Unadjusted operating income. This metric was located in the 10K as a + single line item named: Operating income. + + + Metric 3: Total revenue. This metric was located in the 10K as a single line item + named: Revenues.' + page(s)-0based: 39 + page(s): 40,42 + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated EBITDA Margin percentage value that is in the range from 5.00% to 5.50%, + or, alternatively, a calculated decimal value that is in the range from 0.0500 to 0.0550, + assuming that EBITDA = "Operating Income" + "Depreciation & Amortization of Property, Equipment & Intangibles" + (if the answer is a single number, assume that it is that calculated EBITDA Margin metric value) + + +financebench_id_03282: + sector: Communication Services + + company: Netflix + period: 2017 + doc-type: 10k + doc: NETFLIX_2017_10K + + question-type: metrics-generated + question-reasoning: Information extraction + domain-question-num: '' + question: What is Netflix's year end FY2017 total current liabilities (in USD millions)? + Base your judgments on the information provided primarily in the balance sheet. + + answer: $5466.00 + justification: 'The metric total current liabilities was directly extracted from + the company 10K. The line item name, as seen in the 10K, was: Total current liabilities.' + page(s)-0based: 44 + page(s): '45' + + category: 0-RETRIEVE + correctness: >- + the answer contains a quantity that is equivalent to or approximately equal to + 5466.3, 5466.3 million, 5.4663 billion, + 5466, 5466 million, 5.466 billion, + 5500, 5500 million or 5.5 billion + + evaluator-unreliable: true + + +financebench_id_04302: + sector: Consumer Discretionary + + company: Nike + period: 2018 + doc-type: 10k + doc: NIKE_2018_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: We need to calculate a reasonable approximation (or exact number if possible) + of a financial metric. Basing your judgment by information plainly provided in + the statement of income, what is Nike's three year average of cost of goods sold + as a % of revenue from FY2016 to FY2018? Answer in units of percents and round + to one decimal place. + + answer: 55.1% + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Cost of goods sold. This metric was located in the 10K as a single line + item named: Cost of sales. + + + Metric 2: Total revenue. This metric was located in the 10K as a single line item + named: Revenues.' + page(s)-0based: 45 + page(s): '46' + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated metric percentage value that is in the range from 50.00% to 60.00%, + or, alternatively, a calculated decimal value that is in the range from 0.5000 to 0.6000 + (if the answer is a single number, assume that it is that calculated metric value) + + +financebench_id_03531: + sector: Consumer Discretionary + + company: Nike + period: 2019 + doc-type: 10k + doc: NIKE_2019_10K + + question-type: metrics-generated + question-reasoning: Information extraction + domain-question-num: '' + question: According to the details clearly outlined within the balance sheet, how + much total current assets did Nike have at the end of FY2019? Answer in USD millions. + + answer: $16525.00 + justification: 'The metric total current assets was directly extracted from the + company 10K. The line item name, as seen in the 10K, was: Total current assets.' + page(s)-0based: 53 + page(s): '54' + + category: 0-RETRIEVE + correctness: >- + the answer contains a quantity that is equivalent to or approximately equal to + 16525, 16525 million, 16.525 billion, + 16500, 16500 million or 16.5 billion + + +financebench_id_04080: + sector: Consumer Discretionary + + company: Nike + period: 2021 + doc-type: 10k + doc: NIKE_2021_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: 'When primarily referencing the income statement and the statement of + financial position, what is the FY2021 inventory turnover ratio for Nike? Inventory + turnover ratio is defined as: (FY2021 COGS) / (average inventory between FY2020 + and FY2021). Round your answer to two decimal places.' + + answer: '3.46' + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Cost of goods sold. This metric was located in the 10K as a single line + item named: Cost of sales. + + + Metric 2: Inventories. This metric was located in the 10K as a single line item + named: Inventories.' + page(s)-0based: 58 + page(s): 59,61 + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated Inventory Turnover Ratio decimal value that is in the range from 3.00 to 4.00 + (if the answer is a single number, assume that it is that calculated Inventory Turnover Ratio decimal value) + + +financebench_id_01163: + sector: Consumer Discretionary + + company: Nike + period: 2023 + doc-type: 10k + doc: NIKE_2023_10K + + question-type: domain-relevant + question-reasoning: Numerical reasoning + domain-question-num: dg19 + question: Among operations, investing, and financing activities, which brought in + the most (or lost the least) cash flow for Nike in FY2023? + + answer: Among the three, cash flow from operations was the highest for Nike in FY2023. + justification: '' + page(s)-0based: 61 + page(s): '62' + + category: 1-COMPARE + correctness: >- + the answer identifies Operations / Operating Cash Flows as bringing in most cash + + +financebench_id_00080: + sector: Financials + + company: Paypal + period: 2022 + doc-type: 10k + doc: PAYPAL_2022_10K + + question-type: domain-relevant + question-reasoning: Numerical reasoning OR Logical reasoning + domain-question-num: dg24 + question: Does Paypal have positive working capital based on FY2022 data? If working + capital is not a useful or relevant metric for this company, then please state + that and explain why. + + answer: Yes. Paypal has a positive working capital of $ 1.6Bn as of FY2022 end. + justification: 'Accounts receivable, net+Loans and interest receivable, net of allowances + +Funds receivable and customer accounts+Prepaid expenses and other current assets-Accounts + payable-Funds payable and amounts due to customers-Accrued expenses and other + current liabilities -Income taxes payable + + 963+7431+36357+1898-126-40107-4055-813' + page(s)-0based: 60 + page(s): '61' + + category: 3-CALC-COMPLEX + correctness: >- + the answer affirms that Working Capital is/was positive, + proving so by a calculated Working Capital metric value that is positive + + +financebench_id_04980: + sector: Consumer Staples + + company: PepsiCo + period: 2021 + doc-type: 10k + doc: PEPSICO_2021_10K + + question-type: metrics-generated + question-reasoning: Information extraction + domain-question-num: '' + question: What is the FY2021 capital expenditure amount (in USD billions) for PepsiCo? + Respond to the question by assuming the perspective of an investment analyst who + can only use the details shown within the statement of cash flows. + + answer: $4.60 + justification: 'The metric capital expenditures was directly extracted from the + company 10K. The line item name, as seen in the 10K, was: Capital spending.' + page(s)-0based: 62 + page(s): '63' + + category: 0-RETRIEVE + correctness: >- + the answer contains a quantity that is equivalent to or approximately equal to + 4.625, 4.625 billion, 4625 million, + 4.6, 4.6 billion or 4600 million + + +financebench_id_01009: + sector: Consumer Staples + + company: PepsiCo + period: 2022 + doc-type: 10k + doc: PEPSICO_2022_10K + + question-type: domain-relevant + question-reasoning: Information extraction + domain-question-num: dg08 + question: What are the geographies that Pepsico primarily operates in as of FY2022? + + answer: 'As of FY2022, Pepsico primarily operates in the following geographies: + North America, Latin America, Europe, Africa, Middle East, South Asia, Asia Pacific, + Australia, New Zealand and China.' + justification: '' + page(s)-0based: 3 + page(s): 4, 5 + + category: 0-RETRIEVE + correctness: |- + the answer mentions at least 3 of following geographies: + - North America, which includes United States and Canada; + - Latin America (LatAm); + - Europe; + - Africa, Middle East and South Asia (AMESA); and + - Asia Pacific, Australia and New Zealand and China (APAC) + + +financebench_id_00735: + sector: Consumer Staples + + company: PepsiCo + period: 2022 + doc-type: 10k + doc: PEPSICO_2022_10K + + question-type: domain-relevant + question-reasoning: Information extraction + domain-question-num: dg11 + question: Has Pepsico reported any materially important ongoing legal battles from + FY2022 and FY2021? + + answer: No, Pepsico is not involved in material legal battles. + justification: Management believes the final outcome of legal proceedings will not + have a material adverse outcome. + page(s)-0based: 25 + page(s): '26' + + category: 0-RETRIEVE + correctness: >- + the answer says that there have NOT been material lawsuits / legal battles, + or, alternatively, that lawsuits / legal battles are unlikely to have materially adverse outcomes + + evaluator-unreliable: true + + +financebench_id_01328: + sector: Consumer Staples + + company: PepsiCo + period: 2022 + doc-type: 10k + doc: PEPSICO_2022_10K + + question-type: domain-relevant + question-reasoning: Information extraction + domain-question-num: dg21 + question: What is the quantity of restructuring costs directly outlined in Pepsico's + income statements for FY2022? If restructuring costs are not explicitly outlined + then state 0. + + answer: Pepsico's restructuring costs in FY2022 amounted to $411 million . + justification: '' + page(s)-0based: 77 + page(s): '78' + + category: 0-RETRIEVE + correctness: |- + the answer either: + - mentions a quantity that is equivalent to or approximately equal to 411 million; or + - states 0, zero, and/or that restructuring costs are not explicitly reported + answer-inadequate: true + + +financebench_id_03620: + sector: Consumer Staples + + company: PepsiCo + period: 2022 + doc-type: 10k + doc: PEPSICO_2022_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: What is the FY2022 unadjusted EBITDA less capex for PepsiCo? Define unadjusted + EBITDA as unadjusted operating income + depreciation and amortization [from cash + flow statement]. Answer in USD millions. Respond to the question by assuming the + perspective of an investment analyst who can only use the details shown within + the statement of cash flows and the income statement. + + answer: $9068.00 + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Depreciation and amortization. This metric was located in the 10K as + a single line item named: Depreciation and amortization. + + + Metric 2: Unadjusted operating income. This metric was located in the 10K as a + single line item named: Operating Profit. + + + Metric 3: Capital expenditures. This metric was located in the 10K as a single + line item named: Capital spending.' + page(s)-0based: 61 + page(s): 62,64 + + category: 3-CALC-COMPLEX + correctness: |- + the answer contains a calculated metric value that is either: + - in the range from 8500 to 9500; + - in the range from 8500 million to 9500 million; + - in the range from 8.5 billion to 9.5 billion; or + - stated as approximately 9000 million or 9 billion + (if the answer is a single number, assume that it is that calculated metric value) + + evaluator-unreliable: true + + +financebench_id_04481: + sector: Consumer Staples + + company: PepsiCo + period: 2022 + doc-type: 10k + doc: PEPSICO_2022_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: What is the FY2022 unadjusted EBITDA % margin for PepsiCo? Calculate unadjusted + EBITDA using unadjusted operating income and D&A (from cash flow statement). Give + a response to the question by relying on the details shown in the statement of + cash flows and the P&L statement. + + answer: 16.5% + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Depreciation and amortization. This metric was located in the 10K as + a single line item named: Depreciation and amortization. + + + Metric 2: Unadjusted operating income. This metric was located in the 10K as a + single line item named: Operating Profit. + + + Metric 3: Total revenue. This metric was located in the 10K as a single line item + named: Net Revenue.' + page(s)-0based: 61 + page(s): 62,64 + + category: 3-CALC-COMPLEX + correctness: |- + the answer contains a calculated EBITDA Margin percentage value that is in the range from 16.00% to 17.00%, + or, alternatively, a calculated decimal value that is in the range from 0.1600 to 0.1700 + (if the answer is a single number, assume that it is that calculated EBITDA Margin metric value) + + +financebench_id_01482: + sector: Consumer Staples + + company: PepsiCo + period: 2023 + doc-type: 8k + doc: PEPSICO_2023_8K_dated-2023-05-05 + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: At the Pepsico AGM held on May 3, 2023, what was the outcome of the shareholder + vote on the shareholder proposal for a congruency report by Pepsico on net-zero + emissions policies? + + answer: The shareholder proposal for a congruency report by Pepsico on net-zero + emissions policies was defeated. + justification: '' + page(s)-0based: 3 + page(s): '4' + + category: 1-COMPARE + correctness: >- + the answer says proposal related to Net-Zero Emissions was defeated / not successful + + +financebench_id_00705: + sector: Consumer Staples + + company: PepsiCo + period: 2023 + doc-type: 8k + doc: PEPSICO_2023_8K_dated-2023-05-30 + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: By how much did Pepsico increase its unsecured five year revolving credit + agreement on May 26, 2023? + + answer: $400,000,000 increase. + justification: Increase in five year unsecured revolving credit agreement = May + 26, 2023, five year unsecured revolving credit agreement amount of $4,200,000,000 + - May 27, 2022, five year unsecured revolving credit agreement amount of $3,800,000,000 + = $400,000,000 + page(s)-0based: 1 + page(s): '2' + + category: 2-CALC-CHANGE + correctness: >- + the answer contains a calculated change quantity that is equivalent to or approximately equal to + 400,000,000, 400 million or 0.4 billion + (if the answer is a single number, assume that it is that calculated change amount) + + +financebench_id_00882: + sector: Consumer Staples + + company: PepsiCo + period: 2023 + doc-type: 8k + doc: PEPSICO_2023_8K_dated-2023-05-30 + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: As of May 26, 2023, what is the total amount Pepsico may borrow under + its unsecured revolving credit agreements? + + answer: Total amount Pepsico may borrow under unsecured revolving credit agreements + = $8,400,000,000. + justification: Total amount that may be borrowed under unsecured revolving credit + agreements = 2023, 364 day unsecured revolving credit agreement amount of $4,200,000,000 + + 2023, five year unsecured revolving credit agreement amount of $4,200,000,000 + = $8,400,000,000. + page(s)-0based: 1 + page(s): '2' + + category: 3-CALC-COMPLEX + correctness: |- + the answer either (or both): + - mentions two separate quantities each equal to 4,200,000,000, 4200 million or 4.2 billion; and/or + - contains a calculated total quantity that is greater than or equal to + 8,400,000,000, 8400 million or 8.4 billion + (if the answer is a single number, assume that it is that latter calculated total amount) + + evaluator-unreliable: true + + +financebench_id_01474: + sector: Consumer Staples + + company: PepsiCo + period: 2023 + doc-type: Earnings + doc: PEPSICO_2023Q1_EARNINGS + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: As of FY2023Q1, why did Pepsico raise full year guidance for FY2023? + + answer: Pepsico experienced a strong start to FY2023. + justification: '' + page(s)-0based: 0 + page(s): '1' + + category: 0-RETRIEVE + correctness: >- + the answer mentions strong business performance + + +financebench_id_01476: + sector: Consumer Staples + + company: PepsiCo + period: 2023 + doc-type: Earnings + doc: PEPSICO_2023Q1_EARNINGS + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: As of FY2023Q1, by how many percentage points did Pepsico raise full year + guidance in respect of core constant currency EPS growth? + + answer: Pepsico raised full year guidance in respect of core constant currency EPS + growth by 1 percentage point. + justification: '' + page(s)-0based: 0 + page(s): '1' + + category: 2-CALC-CHANGE + correctness: >- + the answer mentions growth guidance raised from 8% to 9%, + and/or growth guidance raised by 1 percentage point or 1% + + evaluator-unreliable: true + + +financebench_id_00302: + sector: Health Care + + company: Pfizer + period: 2021 + doc-type: 10k + doc: PFIZER_2021_10K + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: Did Pfizer grow its PPNE between FY20 and FY21? + + answer: Yes, change in PPNE was positive year over year + justification: 14882 - 13745 > 0 + page(s)-0based: 58 + page(s): '59' + + category: 1-COMPARE + correctness: >- + the answer concludes that Property, Plant & Equipment (PP&E or PPNE) increased + + evaluator-unreliable: true + + +financebench_id_00702: + sector: Health Care + + company: Pfizer + period: 2021 + doc-type: 10k + doc: PFIZER_2021_10K + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: Were there any potential events that are not in Pfizer's standard business + operations that substantially increased net income in 2019? + + answer: Yes, the gain on completion of Consumer Healthcare JV Transaction + justification: Income statement shows the gain on completion of Consumer Healthcare + JV transaction occured in FY19. In FY21, this event did not affect the net income + at all due to the seemingly one time nature of the line item + page(s)-0based: 56 + page(s): '57' + + category: 5-EXPLAIN-FACTORS + correctness: >- + the answer mentions Consumer Healthcare JV transaction + + +financebench_id_02416: # note: Therachon is mentioned on separate following page + sector: Health Care + + company: Pfizer + period: 2021 + doc-type: 10k + doc: PFIZER_2021_10K + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: What are three main companies acquired by Pfizer mentioned in this 10K + report? + + answer: Trillium, Array, and Therachon + justification: '' + page(s)-0based: 69 + page(s): 70, 71 + + category: 0-RETRIEVE + correctness: >- + the answer mentions Trillium and Array + + +financebench_id_00283: + sector: Health Care + + company: Pfizer + period: 2023 + doc-type: 10q + doc: Pfizer_2023Q2_10Q + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: How much does Pfizer expect to pay to spin off Upjohn in the future in + USD million? + + answer: '77.78' + justification: '10% cost is remaining amount in the future. Calculation: 700/9 is + 10% of the cost remaining' + page(s)-0based: 40 + page(s): '41' + + category: 6-OTHER-ADVANCED + correctness: >- + the answer mentions 700 million and 90% + + evaluator-unreliable: true + + +financebench_id_00724: + sector: Health Care + + company: Pfizer + period: 2023 + doc-type: 10q + doc: Pfizer_2023Q2_10Q + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: For Pfizer, which geographic region had the biggest drop in Q22023 year + over year revenues (on a percentage basis)? + + answer: Developed Rest of the World + justification: It's plainly stated in table format the year over year revenue changes + for each of the regions + page(s)-0based: 37 + page(s): '38' + + category: 1-COMPARE + correctness: >- + the answer identifies Developed Rest of World as having worst percentage/relative decline + + +financebench_id_02419: # tricky: Upjohn spin-off started in 2020 but would complete in 2023 + sector: Health Care + + company: Pfizer + period: 2023 + doc-type: 10q + doc: Pfizer_2023Q2_10Q + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: As of Q2'2023, is Pfizer spinning off any large business segments? + + answer: Yes, it's spinning off Upjohn. + justification: '' + page(s)-0based: 40 + page(s): '41' + + category: 0-RETRIEVE + correctness: >- + the answer mentions Upjohn + + evaluator-unreliable: true + + +financebench_id_00746: + sector: Consumer Discretionary + + company: Ulta Beauty + period: 2023 + doc-type: 10k + doc: ULTABEAUTY_2023_10K + + question-type: domain-relevant + question-reasoning: Information extraction + domain-question-num: dg04 + question: Which debt securities are registered to trade on a national securities + exchange under Ulta Beauty's name as of FY2023? + + answer: There are none + justification: No debt securities listed under securities registered pursuant to + Section 12(b) of the Act. + page(s)-0based: 0 + page(s): '1' + + category: 0-RETRIEVE + correctness: >- + the answer concludes that there are no debt securities traded, + or, alternatively, that no such debt securities are explicitly reported + + +financebench_id_00521: + sector: Consumer Discretionary + + company: Ulta Beauty + period: 2023 + doc-type: 10k + doc: ULTABEAUTY_2023_10K + + question-type: domain-relevant + question-reasoning: Information extraction + domain-question-num: dg10 + question: What are major acquisitions that Ulta Beauty has done in FY2023 and FY2022? + + answer: Ulta Beauty did not make any acquisitions in FY2023 and FY2022. + justification: Consolidated statement of cash flows reflects - for Acquisitions, + net of cash acquired in FY2023 and FY2022. + page(s)-0based: 56 + page(s): '57' + + category: 0-RETRIEVE + correctness: >- + the answer concludes that there are no major acquisitions, + or, alternatively, that no such major acquisitions are explicitly reported + + +financebench_id_00601: + sector: Consumer Discretionary + + company: Ulta Beauty + period: 2023 + doc-type: Earnings + doc: ULTABEAUTY_2023Q4_EARNINGS + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: What drove the reduction in SG&A expense as a percent of net sales in + FY2023? + + answer: Lower marketing expenses and leverage of incentive compensation due to higher + sales. The answer here assumes FY2023 refers to the 12 months ended on January + 28, 2023 (although the company refers to this period as its fiscal 2022. + justification: Fiscal 2022 = FY2023. Fiscal 2021 = FY2022. + page(s)-0based: 1 + page(s): '2' + + category: 0-RETRIEVE + correctness: >- + the answer mentions marketing expenses and incentive compensation + answer-inadequate: true + + +financebench_id_00603: + sector: Consumer Discretionary + + company: Ulta Beauty + period: 2023 + doc-type: Earnings + doc: ULTABEAUTY_2023Q4_EARNINGS + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: What drove the increase in Ulta Beauty's merchandise inventories balance + at end of FY2023? + + answer: Increase in Merchandise inventories balance was driven by the opening of + 47 new stores. The answer here assumes FY2023 refers to the 12 months ended on + January 28, 2023 (although the company refers to this period as its fiscal 2022. + justification: Fiscal 2022 = FY2023. Fiscal 2021 = FY2022. + page(s)-0based: 2 + page(s): '2' + + category: 0-RETRIEVE + correctness: >- + the answer mentions new stores + + +financebench_id_00605: + sector: Consumer Discretionary + + company: Ulta Beauty + period: 2023 + doc-type: Earnings + doc: ULTABEAUTY_2023Q4_EARNINGS + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: What percent of Ulta Beauty's total spend on stock repurchases for FY + 2023 occurred in Q4 of FY2023? + + answer: 36%. The answer here assumes FY2023 refers to the 12 months ended on January + 28, 2023 (although the company refers to this period as its fiscal 2022. + justification: Fiscal 2022 = FY2023. Fiscal 2021 = FY2022. Percent spent in Q4 of + FY2023 = Amount spent in Q4 of FY2023/Total amount spent in FY2023*100 =$328.1 + million /$900 million * 100 = 36% + page(s)-0based: 2 + page(s): '3' + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated percentage value that is in the range from 30% to 40% + (if the answer is a single number, assume that it is that calculated percentage value) + + +financebench_id_00606: # tricky: highly implicit wordings + sector: Consumer Discretionary + + company: Ulta Beauty + period: 2023 + doc-type: Earnings + doc: ULTABEAUTY_2023Q4_EARNINGS + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: Did Ulta Beauty's wages expense as a percent of net sales increase or + decrease in FY2023? + + answer: Wages expense as a percent of net sales increased in FY2023. The answer + here assumes FY2023 refers to the 12 months ended on January 28, 2023 (although + the company refers to this period as its fiscal 2022. + justification: Fiscal 2022 = FY2023. Fiscal 2021 = FY2022. Store payroll and benefits + = wages. Store payroll and benefits offsets reduction in SG&A percent of net sales + in FY2023. + page(s)-0based: 1 + page(s): '2' + + category: 6-OTHER-ADVANCED + correctness: >- + the answer concludes that Wages as percent of Net Sales increased + + +financebench_id_00859: + sector: Communication Services + + company: Verizon + period: 2021 + doc-type: 10k + doc: VERIZON_2021_10K + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: Among all of the derivative instruments that Verizon used to manage the + exposure to fluctuations of foreign currencies exchange rates or interest rates, + which one had the highest notional value in FY 2021? + + answer: Cross currency swaps. Its notional value was $32,502 million. + justification: The derivative instruments used to mangae the exposure were interest + rate swaps, cross currency swaps, forward starting interest rate swaps, and foreign + exchange forwards. 32502 > 19779 > 1000 > 932 + page(s)-0based: 84 + page(s): '85' + + category: 1-COMPARE + correctness: >- + the answer identifies Cross Currency Swaps as having highest notional value + + +financebench_id_02024: + sector: Communication Services + + company: Verizon + period: 2021 + doc-type: 10k + doc: VERIZON_2021_10K + + question-type: novel-generated + question-reasoning: '' + domain-question-num: '' + question: As of FY 2021, how much did Verizon expect to pay for its retirees in + 2024? + + answer: The estimated pension benefits were $1097 million, and the estimated health + care and life insurance benefits were $862 million. + justification: '' + page(s)-0based: 62 + page(s): 63, 94 + + category: 0-RETRIEVE + correctness: |- + the answer mentions at least 1 of following: + - amount of 1,097 million, or 1.1 billion, or approximately equivalent amount (explicitly or implicitly for "Pension (Benefits)"); + - amount of 862 million, or approximately equivalent amount (explicitly or implicitly for "Health Care & Life (Insurance)"; or + - total amount of 1,959 million, or 1.96 billion, or 2.0 billion, or an approximately equivalent amount + + +financebench_id_00216: + sector: Communication Services + + company: Verizon + period: 2022 + doc-type: 10k + doc: VERIZON_2022_10K + + question-type: domain-relevant + question-reasoning: Logical reasoning (based on numerical reasoning) OR Logical + reasoning + domain-question-num: dg01 + question: Does Verizon have a reasonably healthy liquidity profile based on its + quick ratio for FY 2022? If the quick ratio is not relevant to measure liquidity, + please state that and explain why. + + answer: No. The quick ratio was approximately 0.54 for Verizon. It indicated that + Verizon does not have a healthy liquidity profile. + justification: Quick ratio = (current assets - inventories - prepaid expenses) / + current liabilities = (37857 - 2388 - 8358) / 50171 = 0.5403719 + page(s)-0based: 55 + page(s): '56' + + category: 4-CALC-AND-JUDGE + correctness: >- + the answer contains a calculated Quick Ratio decimal value that is in the range from 0.40 to 0.80, + or, alternatively, a calculated percentage value that is in the range from 40% to 80% + + +financebench_id_00215: + sector: Communication Services + + company: Verizon + period: 2022 + doc-type: 10k + doc: VERIZON_2022_10K + + question-type: domain-relevant + question-reasoning: Logical reasoning (based on numerical reasoning) + domain-question-num: dg06 + question: Is Verizon a capital intensive business based on FY 2022 data? + + answer: Yes. Verizon's capital intensity ratio was approximately 2.774729. This + means that it took approximately $2.77 of assets to generate $1 of revenue and + thus, Verizon can be considered capital intensive. + justification: capital intensity ratio = total asset / revenue = 379680/ 136835 + = 2.774729, which is relatively high + page(s)-0based: 55 + page(s): 56, 23 + + category: 4-CALC-AND-JUDGE + correctness: >- + the answer opines that Verizon's business is capital-intensive, and justifies such opinion with a calculated ratio + + evaluator-unreliable: true + + +financebench_id_00566: + sector: Communication Services + + company: Verizon + period: 2022 + doc-type: 10k + doc: VERIZON_2022_10K + + question-type: domain-relevant + question-reasoning: Numerical reasoning + domain-question-num: dg22 + question: Has Verizon increased its debt on balance sheet between 2022 and the 2021 + fiscal period? + + answer: No. Verizon's debt decreased by $229 million. + justification: debt change = debt in 2022 - debt in 2021 = 150639 - 150868 = -229 + page(s)-0based: 76 + page(s): '77' + + category: 1-COMPARE + correctness: >- + the answer concludes that debt decreased + + evaluator-unreliable: true + + +financebench_id_06247: + sector: Consumer Staples + + company: Walmart + period: 2018 + doc-type: 10k + doc: WALMART_2018_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: 'What is FY2018 days payable outstanding (DPO) for Walmart? DPO is defined + as: 365 * (average accounts payable between FY2017 and FY2018) / (FY2018 COGS + + change in inventory between FY2017 and FY2018). Round your answer to two decimal + places. Please base your judgments on the information provided primarily in the + statement of financial position and the P&L statement.' + + answer: '42.69' + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Accounts payable. This metric was located in the 10K as a single line + item named: Accounts payable. + + + Metric 2: Inventories. This metric was located in the 10K as a single line item + named: Inventories. + + + Metric 3: Cost of goods sold. This metric was located in the 10K as a single line + item named: Cost of sales.' + page(s)-0based: 56 + page(s): 57,59 + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated Days Payable Outstanding (DPO) decimal value that is in the range from 35.00 to 50.00 + (if the answer is a single number, assume that it is that calculated Days Payable Outstanding (DPO) decimal value) + + +financebench_id_04784: + sector: Consumer Staples + + company: Walmart + period: 2019 + doc-type: 10k + doc: WALMART_2019_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: Based on the information provided primarily in the statement of income, + what is the FY2018 - FY2019 change in unadjusted operating income % margin for + Walmart? Answer in units of percents and round to one decimal place. + + answer: 0.2% + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Unadjusted operating income. This metric was located in the 10K as a + single line item named: Operating income. + + + Metric 2: Total revenue. This metric was located in the 10K as a single line item + named: Total revenues.' + page(s)-0based: 47 + page(s): '48' + + category: 3-CALC-COMPLEX + correctness: |- + the answer contains either: + - calculated Operating Income Margin percentage values for 2018 and 2019, + and their difference, which is a percentage value less than 0.5% in magnitude; or + - calculated Operating Income Margin decimal values for 2028 and 2019, + and their difference, which is a decimal value less than 0.005 in magnitude + answer-inadequate: true + + +financebench_id_06741: + sector: Consumer Staples + + company: Walmart + period: 2020 + doc-type: 10k + doc: WALMART_2020_10K + + question-type: metrics-generated + question-reasoning: Numerical reasoning + domain-question-num: '' + question: What is the FY2018 - FY2020 3 year average unadjusted EBITDA % margin + for Walmart? Define unadjusted EBITDA as unadjusted operating income + depreciation + and amortization from the cash flow statement. Answer in units of percents and + round to one decimal place. Calculate what was asked by utilizing the line items + clearly shown in the P&L statement and the cash flow statement. + + answer: 6.2% + justification: 'The metric in question was calculated using other simpler metrics. + The various simpler metrics (from the current and, if relevant, previous fiscal + year(s)) used were: + + + Metric 1: Depreciation and amortization. This metric was located in the 10K as + a single line item named: Depreciation and amortization. + + + Metric 2: Unadjusted operating income. This metric was located in the 10K as a + single line item named: Operating income. + + + Metric 3: Total revenue. This metric was located in the 10K as a single line item + named: Total revenues.' + page(s)-0based: 50 + page(s): 51,56 + + category: 3-CALC-COMPLEX + correctness: >- + the answer contains a calculated EBITDA Margin percentage value that is in the range from 5.50% to 6.50%, + or, alternatively, a calculated decimal value that is in the range from 0.0550 to 0.0650 + (if the answer is a single number, assume that it is that calculated EBITDA Margin metric value) diff --git a/examples/FinanceBench-Lite/knowledge-store.txt b/examples/FinanceBench-Lite/knowledge-store.txt new file mode 100644 index 000000000..e623a859d --- /dev/null +++ b/examples/FinanceBench-Lite/knowledge-store.txt @@ -0,0 +1,45 @@ +Liquidity Metric Formulas +------------------------- + +`(Net) Working Capital` = `(Total) Current Assets` - `(Total) Current Liabilities` + +`Working Capital Ratio` = `(Total) Current Assets` / `(Total) Current Liabilities` + +`Quick Ratio` = ( + (`Cash & Cash Equivalents` + + `Short-Term Investments or (Current) Marketable Securities` + + `(Net) Accounts Receivable, a.k.a. (Net) (Trade) Receivables`) + / `(Total) Current Liabilities` +) + +`Operating Cash Flow Ratio` = ( + `(Net) Cash Flows from Operations, a.k.a. (Net) Operating Cash Flows` + / `(Total) Current Liabilities` +) + +`Free Cash Flow, a.k.a. FCF` = ( + `(Net) Cash Flows from Operations, a.k.a. (Net) Operating Cash Flows` - + `Capital Expenditure(s), a.k.a. CapEx, or Capital Spending, or Property, Plant & Equipment (PP&E) Expenditure(s)/Purchase(s)` +) + +`Free Cash Flow Conversion Ratio` = `Free Cash Flow, a.k.a. FCF` / `Earnings before Interest, Tax, Depreciation & Amortization, a.k.a. EBITDA` + +`Days Inventory Outstanding, a.k.a. DIO` = ( + 365 * `average (Total) (Net) Inventory(ies), typically between two consecutive fiscal year-ends` + / `(Total) Cost of Goods Sold, a.k.a. (Total) COGS, or (Total) Cost of Sales, or (Total) Cost of Revenue` +) + +`Days Payable Outstanding, a.k.a. DPO` = ( + 365 * `average Accounts Payable, typically between two consecutive fiscal year-ends` + / (`(Total) Cost of Goods Sold, a.k.a. (Total) COGS, or (Total) Cost of Sales, or (Total) Cost of Revenue` + + `change in (Total) (Net) Inventory(ies), typically between two consecutive fiscal year-ends`) +) + +`Days Sales Oustanding, a.k.a. DSO` = ( + 365 * `average (Net) Accounts Receivable, a.k.a. (Net) (Trade) Receivables, typically between two consecutive fiscal year-ends` + / `(Total) (Net) (Operating) Revenue(s), a.k.a. (Total) (Net) Sales` +) + +`Cash Conversion Cycle, a.k.a. CCC` = ( + `Days Inventory Outstanding, a.k.a. DIO` + `Days Sales Oustanding, a.k.a. DSO` - `Days Payable Outstanding, a.k.a. DPO` +) diff --git a/examples/FinanceBench-Lite/log.py b/examples/FinanceBench-Lite/log.py new file mode 100644 index 000000000..874f12f53 --- /dev/null +++ b/examples/FinanceBench-Lite/log.py @@ -0,0 +1,39 @@ +from pathlib import Path +# import sys + +from loguru import logger + +from data_and_knowledge import FbId, DOC_NAMES_BY_FB_ID + + +LOG_DIR_PATH: Path = Path(__file__).parent / '.log' +CURRENT_LOG_HANDLER_ID: int | None = None + + +# loguru.readthedocs.io/en/stable/api/logger.html#loguru._logger.Logger.add +# logger.add(sink=sys.stdout, level='DEBUG', +# # format=..., +# filter=None, +# colorize=True, +# serialize=False, +# backtrace=True, diagnose=True, +# enqueue=False, context=None, +# catch=True) + + +def switch_log_file(fb_id: FbId, output_name: str): + global CURRENT_LOG_HANDLER_ID # pylint: disable=global-statement + + if CURRENT_LOG_HANDLER_ID is not None: + logger.remove(handler_id=CURRENT_LOG_HANDLER_ID) + + CURRENT_LOG_HANDLER_ID = logger.add(sink=(Path(LOG_DIR_PATH) / + DOC_NAMES_BY_FB_ID[fb_id] / fb_id[16:] / f'{output_name}.log'), + level='DEBUG', + # format=..., + filter=None, + colorize=True, + serialize=False, + backtrace=True, diagnose=True, + enqueue=False, context=None, + catch=True) diff --git a/examples/FinanceBench-Lite/program-store.yml b/examples/FinanceBench-Lite/program-store.yml new file mode 100644 index 000000000..36e65732c --- /dev/null +++ b/examples/FinanceBench-Lite/program-store.yml @@ -0,0 +1,36 @@ +quick-ratio: + task: Assess liquidity health of {COMPANY} through its `Quick Ratio` as at {PERIOD} fiscal period end + + sub-htps: + - task: |- + Calculate `Quick Ratio` of {COMPANY} as at {PERIOD} fiscal period end as decimal value according to formula: + + `Quick Ratio` = ( + (`Cash & Cash Equivalents` + + `Short-Term Investments or (Current) Marketable Securities` + + `(Net) Accounts Receivable, a.k.a. (Net) (Trade) Receivables`) + / `(Total) Current Liabilities` + ) + + sub-htps: + # 1 single Retrieval task for multiple quantities on same statement, for both efficiency & mutual consistency; + # retrieve individual numerator & denominator balance values only, without taking division + # because RAG LMs may not be good at calculation & mathematical reasoning + - task: |- + What are values in dollars of: + - `Cash & Cash Equivalents`; + - `Short-Term Investments or (Current) Marketable Securities`; + - `(Net) Accounts Receivable, a.k.a. (Net) (Trade) Receivables`; and + - `(Total) Current Liabilities` + (or most similar-meaning reported line items to those) + + on one same `(Consolidated) Balance Sheet, a.k.a. Statement of (Consolidated) Financial Position` + (or most similar-meaning statement) of {COMPANY} + (and NOT Balance Sheets of its acquired and/or divested companies) + + as at {PERIOD} fiscal period end? + + - task: |- + Compare calculated `Quick Ratio` decimal value against 1.00 and make assessment: + - `Quick Ratio` >= 1.00: liquidity is healthy; or + - `Quick Ratio` < 1.00: liquidity is not very healthy diff --git a/examples/FinanceBench-Lite/rag-ground-truths.yml b/examples/FinanceBench-Lite/rag-ground-truths.yml new file mode 100644 index 000000000..6ef352009 --- /dev/null +++ b/examples/FinanceBench-Lite/rag-ground-truths.yml @@ -0,0 +1,914 @@ +defs: + + BS: (Consolidated) Balance Sheet, a.k.a. Statement of (Consolidated) Financial Position + + cash-and-equiv: Cash & Cash Equivalents + st-invest: Short-Term Investments or (Current) Marketable Securities + recvables: (Net) Accounts Receivable, a.k.a. (Net) (Trade) Receivables + invent: (Total) (Net) Inventory(ies) + curr-assets: (Total) Current Assets + fixed-assets: (Net) Fixed Assets, a.k.a. (Net) Property, Plant & Equipment (PP&E) + total-assets: Total Assets + + payables: Accounts Payable + st-debt: Short-Term Debt, or Current Portion of (Long-Term) Debt + curr-liabs: (Total) Current Liabilities + lt-debt: Long-Term Debt (EXCLUDING any current/short-term portion) + + + CF: (Consolidated) Cash Flow(s) Statement(s), a.k.a. (Consolidated) Statement(s) of Cash Flows + + d&a: Depreciation & Amortization, a.k.a. D&A (of Fixed Assets or Property, Plant & Equipment (PP&E)) + op-cf: (Net) Cash Flows from Operations, a.k.a. (Net) Operating Cash Flows + + capex: Capital Expenditure(s), a.k.a. CapEx, or Capital Spending, or Property, Plant & Equipment (PP&E) Expenditure(s)/Purchase(s) + + div: Cash Dividends + + + P&L: >- + (Consolidated) Income Statement, a.k.a. (Consolidated) Profit-and-Loss (P&L) Statement, + or (Consolidated) Earnings Statement, or (Consolidated) Operations Statement + + rev: (Total) (Net) (Operating) Revenue(s), a.k.a. (Total) (Net) Sales + cogs: (Total) Cost of Goods Sold, a.k.a. (Total) COGS, or (Total) Cost of Sales, or (Total) Cost of Revenue + gross: Gross Income, a.k.a. Gross Profit, or Gross Earnings (or Loss(es)) + op: (Unadjusted) Operating Income, a.k.a. Operating Profit, or Operating Earnings (or Loss(es)) + ebitda: (Unadjusted) Earnings before Interest, Tax, Depreciation & Amortization, a.k.a. EBITDA + ebit: Earnings before Interest & Tax, a.k.a. EBIT + int: Interest Expense + ebt: Income or Profit or Earnings (or Loss(es)) before (Income) Tax(es) + inc-tax: (Income) Tax Expense + net: Net Income, a.k.a. Net Profit, or Net Earnings (or Loss(es)) (Attributable to Shareholders) + + +ground-truths: + + 3M_2018_10K: + BS: + fixed-assets: + 2018: 8,738 million or 8.7 billion + 2017: 8,866 million or 8.9 billion # unreliable + + + 3M_2022_10K: + BS: + fixed-assets: + 2022: 9,178 million + 2021: 9,429 million + + total-assets: + 2022: 46,455 million + 2021: 47,072 million + + CF: + capex: + 2022: 1,749 million + 2021: 1,603 million + 2020: 1,501 million # unreliable + + P&L: + rev: + 2022: 34,229 million + 2021: 35,355 million + 2020: 32,184 million + + net: + 2022: 5,777 million + 2021: 5,921 million + 2020: 5,449 million + + + 3M_2023Q2_10Q: + BS: + cash-and-equiv: + 2023Q2: 4,258 million # unreliable + 2022: 3,655 million + + st-invest: + 2023Q2: 56 million + 2022: 238 million + + recvables: + 2023Q2: 4,947 million + 2022: 4,532 million + + invent: + 2023Q2: 5,280 million + 2022: 5,372 million + + curr-assets: + 2023Q2: 15,754 million + 2022: 14,688 million + + curr-liabs: + 2023Q2: 10,936 million + 2022: 9,523 million + + + ACTIVISIONBLIZZARD_2019_10K: + BS: + fixed-assets: + 2019: 253 million + 2018: 282 million + + CF: + capex: + 2019: 116 million + 2018: 131 million + 2017: 155 million # unreliable + + P&L: + rev: + 2019: 6,489 million + 2018: 7,500 million + 2017: 7,017 million + + + ADOBE_2015_10K: + BS: + curr-liabs: + 2015: 2,213.556 million or 2,213.6 million or 2.21 billion or 2.2 billion + 2014: 2,494.435 million or 2,494.4 million or 2.49 billion or 2.5 billion + + CF: + op-cf: + 2015: 1,469.502 million or 1,469.5 million or 1.47 billion or 1.5 billion + 2014: 1,287.482 million or 1,287.5 million or 1.29 billion or 1.3 billion + 2013: 1,151.686 million or 1,151.6 million or 1.15 billion or 1.2 billion + + + ADOBE_2016_10K: + P&L: + op: + 2016: 1,493.602 million or 1,493.6 million or 1.49 billion or 1.5 billion # unreliable + 2015: 903.095 million or 903.1 million or 0.9 billion # unreliable + 2014: 412.685 million or 412.7 million or 0.41 billion or 0.4 billion # unreliable + + + ADOBE_2017_10K: + BS: + curr-liabs: + 2017: 3,527.457 million or 3,527.5 million or 3.53 billion or 3.5 billion + 2016: 2,811.635 million or 2,811.6 million or 2.81 billion or 2.8 billion + + CF: + op-cf: + 2017: 2,912.853 million or 2,912.9 million or 2.91 billion or 2.9 billion + 2016: 2,199.728 million or 2,199.7 million or 2.2 billion + 2013: 1,469.502 million or 1,469.5 million or 1.47 billion or 1.5 billion # unreliable + + + ADOBE_2022_10K: + CF: + op-cf: + 2022: 7,838 million + 2021: 7,230 million + 2020: 5,727 million + + capex: + 2022: 442 million # unreliable + 2021: 348 million # unreliable + 2020: 419 million # unreliable + + P&L: + rev: + 2022: 17,606 million # unreliable + 2021: 15,785 million + 2020: 12,868 million + + op: + 2022: 6,098 million + 2021: 5,802 million + 2020: 4,237 million + + net: + 2022: 4,756 million + 2021: 4,822 million + 2020: 5,260 million + + + AES_2022_10K: + BS: + invent: + 2022: 1,055 million + 2021: 604 million + + total-assets: + 2022: 38,363 million + 2021: 32,963 million + + P&L: + cogs: + 2022: 10,069 million # unreliable + 2021: 8,430 million # unreliable + 2020: 6,967 million # unreliable + + net: + 2022: negative (loss) 546 million + 2021: negative (loss) 409 million # unreliable + 2020: 46 million + + + AMAZON_2017_10K: + BS: + invent: + 2017: 16,047 million + 2016: 11,461 million # unreliable + + payables: + 2017: 34,616 million + 2016: 25,309 million + + P&L: + rev: + 2017: 177,866 million + 2016: 135,987 million + 2015: 107,006 million + + cogs: + 2017: 111,934 million # unreliable: often mistaken for Total Operating Expenses $173,760 million + 2016: 88,265 million # unreliable: often mistaken for Total Operating Expenses $131,801 million + 2015: 71,651 million + + + AMCOR_2020_10K: + BS: + recvables: + 2020: 1,615.9 million # unreliable + 2019: 1,864.3 million # unreliable + + + AMCOR_2023_10K: + BS: + cash-and-equiv: + 2023: 689 million + 2022: 775 million + + st-invest: + 2023: 0 (or not explicitly reported) + 2022: 0 (or not explicitly reported) + + recvables: + 2023: 1,875 million # unreliable + 2022: 1,935 million + + invent: + 2023: 992 million + 1,221 million, or 2,213 million + 2022: 1,114 million + 1,325 million, or 2,439 million + + curr-assets: + 2023: 5,308 million + 2022: 5,853 million + + curr-liabs: + 2023: 4,476 million + 2022: 5,103 million + + P&L: + rev: + 2023: 14,694 million + 2022: 14,544 million + 2021: 12,861 million + + gross: + 2023: 2,725 million + 2022: 2,820 million + 2021: 2,732 million + + + AMCOR_2023Q4_EARNINGS: + P&L: + rev: + 2023Q4: 3,673 million + 2023FY: 14,694 million + 2022Q4: 3,909 million + 2022FY: 14,544 million + + ebitda: + 2023Q4: 540 million # unreliable: FY & Quarter numbers often mistaken for each other + 2023FY: 2,018 million # unreliable: FY & Quarter numbers often mistaken for each other + + + AMD_2015_10K: + CF: + d&a: + 2015: 167 million + 2014: 203 million + 2013: 236 million + + P&L: + rev: + 2015: 3,991 million + 2014: 5,506 million + 2013: 5,299 million + + + AMD_2022_10K: + BS: + cash-and-equiv: + 2022: 4,835 million # unreliable + 2021: 2,535 million # unreliable + + st-invest: + 2022: 1,020 million + 2021: 1,073 million + + recvables: + 2022: 4,126 million # unreliable + 2021: 2,706 million # unreliable + + invent: + 2022: 3,771 million + 2021: 1,955 million # unreliable + + curr-assets: + 2022: 15,019 million + 2021: 8,583 million + + curr-liabs: + 2022: 6,369 million + 2021: 4,240 million + + + AMERICANWATERWORKS_2021_10K: + CF: + d&a: + 2021: 636 million # unreliable + 2020: 604 million # unreliable + 2019: 582 million # unreliable + + P&L: + op: + 2021: 1,196 million + 2020: 1,248 million + 2019: 1,214 million + + + AMERICANWATERWORKS_2022_10K: + BS: + curr-assets: + 2022: 1,250 million + 2021: 1,554 million + + curr-liabs: + 2022: 2,811 million + 2021: 2,141 million + + + BESTBUY_2017_10K: + P&L: + rev: + 2017: 39,403 million + 2016: 39,528 million + 2015: 40,339 million + + net: + 2017: 1,228 million # unreliable: often mistaken for Net Earnings (Loss) from Continuing Operations $1,207m + 2016: 897 million # unreliable: often mistaken for Net Earnings (Loss) from Continuing Operations $807m + 2015: 1,233 million # unreliable: often mistaken for Net Earnings (Loss) from Continuing Operations $1,246m + + + BESTBUY_2019_10K: + BS: + invent: + 2019: 5,409 million + 2018: 5,209 million + + + BESTBUY_2023_10K: + P&L: + rev: + 2023: 46,298 million or 46.3 billion + 2022: 51,761 million or 51.8 billion + 2021: 47,262 million or 47.3 billion + + gross: + 2023: 9,912 million or 9.9 billion # unreliable + 2022: 11,640 million or 11.6 billion + 2021: 10,573 million or 10.6 billion + + + BLOCK_2016_10K: + BS: + curr-assets: + 2016: 1,001,425 or 1,001.4 million or 1.0 billion + 2015: 705,563 or 705.6 million or 0.7 billion + + curr-liabs: + 2016: 577,464 or 577.5 million or 0.6 billion # unreliable + 2015: 334,202 or 334.2 million or 0.3 billion # unreliable + + + BOEING_2018_10K: + BS: + fixed-assets: + 2018: 12,645 million # unreliable: 2018 & 2017 numbers often mixed up + 2017: 12,672 million # unreliable: 2018 & 2017 numbers often mixed up + + + BOEING_2022_10K: + P&L: + rev: + 2022: 66,608 million + 2021: 62,286 million + 2020: 58,158 million + + gross: + 2022: 3,502 million # unreliable because of missing line-time label + 2021: 3,017 million # unreliable because of missing line-time label + 2020: negative (loss) 5,685 million # unreliable because of missing line-time label + + ebt: + 2022: negative (loss) 5,022 million + 2021: negative (loss) 5,033 million + 2020: negative (loss) 14,476 million + + inc-tax: + 2022: tax of 31 million + 2021: tax benefit of 743 million + 2020: tax benefit of 2,535 million + + + COCACOLA_2017_10K: + BS: + total-assets: + 2017: 36,545 million # unreliable + 2016: 34,010 million # unreliable + + P&L: + net: + 2017: 1,248 million + 2016: 6,527 million + 2015: 7,351 million + + + COCACOLA_2021_10K: + P&L: + rev: + 2021: 38,655 million + 2020: 33,014 million + 2019: 37,266 million + + cogs: + 2021: 15,357 million + 2020: 13,433 million # unreliable + 2019: 14,619 million # unreliable + + + COCACOLA_2022_10K: + CF: + div: + 2022: 7,616 million + 2021: 7,252 million + 2020: 7,047 million + + P&L: + net: + 2022: 9,542 million + 2021: 9,771 million + 2020: 7,747 million + + + CORNING_2020_10K: + BS: + invent: + 2020: 2,438 million + 2019: 2,320 million + + payables: + 2020: 1,174 million # unreliable: often mistaken for Other Accrued Liabilities #2,437m + 2019: 1,587 million # unreliable: often mistaken for Other Accrued Liabilities $1,923m + + P&L: + cogs: + 2020: 7,772 million # unreliable: often failing to be retrieved at all + 2019: 7,468 million # unreliable: often failing to be retrieved at all + 2018: 6,829 million # unreliable: often failing to be retrieved at all + + + CORNING_2021_10K: + P&L: + rev: + 2021: 14,082 million # unreliable + 2020: 11,303 million + 2019: 11,503 million + + op: + 2021: 2,112 million + 2020: 509 million + 2019: 1,306 million + + + CORNING_2022_10K: + BS: + curr-assets: + 2022: 7,453 million + 2021: 7,659 million + + curr-liabs: + 2022: 5,175 million + 2021: 4,806 million + + + CVSHEALTH_2018_10K: + BS: + fixed-assets: + 2018: 11,349 million # unreliable: often failing to be retrieved at all + 2017: 10,292 million # unreliable: often failing to be retrieved at all + + P&L: + rev: + 2018: 194,579 million # unreliable: often mistaken for Pharmacy Services 2018 revenue $134,128m or Retail/LTC 2018 revenue $83,989m + 2017: 184,786 million # unreliable: often mistaken for Pharmacy Services 2017 revenue $130,601m + 2016: 177,546 million + + + CVSHEALTH_2022_10K: + BS: + fixed-assets: + 2022: 12,873 million # unreliable + 2021: 12,896 million + + total-assets: + 2022: 228,275 million + 2021: 232,999 million + + CF: + capex: + 2022: 2,727 million or 2.7 billion + 2021: 2,520 million or 2.5 billion + 2020: 2,437 million or 2.4 billion + + P&L: + rev: + 2022: 322,467 million + 2021: 292,111 million + 2020: 268,706 million + + net: + 2022: 4,149 million + 2021: 7,910 million # unreliable + 2020: 7,179 million # unreliable + + + GENERALMILLS_2019_10K: + BS: + recvables: + 2019: 1,679.7 million + 2018: 1,684.2 million # unreliable + + invent: + 2019: 1,559.3 million + 2018: 1,642.2 million # unreliable + + payables: + 2019: 2,854.1 million + 2018: 2,746.2 million # unreliable + + P&L: + rev: + 2019: 16,865.2 million + 2018: 15,740.4 million + 2017: 15,619.8 million + + cogs: + 2019: 11,108.4 million + 2018: 10,304.8 million + 2017: 10,052.0 million + + + GENERALMILLS_2020_10K: + BS: + curr-assets: + 2020: 5,121.3 million + 2019: 4,186.5 million + + curr-liabs: + 2020: 7,491.5 million + 2019: 7,087.1 million + + CF: + op-cf: + 2020: 3,676.2 million + 2019: 2,807.0 million + 2018: 2,841.0 million + + capex: + 2020: 460.8 million + 2019: 537.6 million + 2018: 622.7 million + + + GENERALMILLS_2022_10K: + CF: + div: + 2022: 1,244.5 million + 2021: 1,246.4 million + 2020: 1,195.8 million + + P&L: + net: + 2022: 2,707.3 million # unreliable + 2021: 2,339.8 million # unreliable + 2020: 2,181.2 million # unreliable + + + JOHNSON_JOHNSON_2022_10K: + BS: + invent: + 2022: 12,483 million + 2021: 10,387 million + + P&L: + cogs: + 2022: 31,089 million + 2021: 29,855 million + 2020: 28,427 million + + + KRAFTHEINZ_2019_10K: + BS: + invent: + 2019: 2,721 million + 2018: 2,667 million + + P&L: + cogs: + 2019: 16,830 million + 2018: 17,347 million # unreliable + 2017: 17,043 million + + + LOCKHEEDMARTIN_2020_10K: + BS: + total-assets: + 2020: 50,710 million + 2019: 47,528 million + + P&L: + rev: + 2020: 65,398 million + 2019: 59,812 million # unreliable + 2018: 53,762 million + + + LOCKHEEDMARTIN_2021_10K: + BS: + curr-assets: + 2021: 19,815 million + 2020: 19,378 million + + curr-liabs: + 2021: 13,997 million + 2020: 13,933 million + + + LOCKHEEDMARTIN_2022_10K: + P&L: + rev: + 2022: 65,984 million + 2021: 67,044 million + 2020: 65,398 million + + + MGMRESORTS_2018_10K: + BS: + payables: + 2018: 302.578 million or 302.6 million or 0.3 billion + 2017: 255.028 million or 255 million or 0.26 billion or 0.3 billion + + + MGMRESORTS_2020_10K: + CF: + capex: + 2020: 270.579 million or 271 million + 2019: 739.006 million or 739 million # unreliable + 2018: 1,486.843 million or 1,487 million # unreliable + + P&L: + rev: + 2020: 5,162.082 million or 5,162 million + 2019: 12,899.672 million or 12,900 million # unreliable + 2018: 11,763.096 million or 11,763 million + + + # MGMRESORTS_2022Q4_EARNINGS: + # P&L: + # ebit: + # int: + + + MICROSOFT_2016_10K: + P&L: + cogs: + 2016: 32,780 million # unreliable + 2015: 33,038 million # unreliable + 2014: 27,078 million # unreliable + + + MICROSOFT_2023_10K: + BS: + st-debt: + 2023: 5,247 million + 2022: 2,749 million + + lt-debt: + 2023: 41,990 million + 2022: 47,032 million + + + NETFLIX_2015_10K: + CF: + d&a: + 2015: 62.283 million or 62 million # unreliable: often failing to be retrieved at all + 2014: 54.028 million or 54 million # unreliable: often failing to be retrieved at all + 2013: 48.374 million or 48 million # unreliable: often failing to be retrieved at all + + P&L: + rev: + 2015: 6,779.511 million or 6,780 million + 2014: 5,504.656 million or 5,505 million + 2013: 4,374.562 million or 4,375 million + + op: + 2015: 305.826 million or 306 million + 2014: 402.648 million or 403 million + 2013: 228.347 million or 228 million + + + NIKE_2018_10K: + P&L: + rev: + 2018: 36,397 million + 2017: 34,350 million + 2016: 32,376 million + + cogs: + 2018: 20,441 million + 2017: 19,038 million + 2016: 17,405 million + + + NIKE_2021_10K: + BS: + invent: + 2021: 6,854 million + 2020: 7,367 million + + P&L: + cogs: + 2021: 24,576 million + 2020: 21,162 million # unreliable + 2019: 21,643 million + + + PAYPAL_2022_10K: + BS: + curr-assets: + 2022: 57,517 million + 2021: 52,574 million + + curr-liabs: + 2022: 45,101 million + 2021: 43,029 million + + + PEPSICO_2021_10K: + CF: + capex: + 2021: 4,625 million + 2020: 4,240 million + 2019: 4,232 million + + + PEPSICO_2022_10K: + CF: + d&a: + 2022: 2,763 million # unreliable + 2021: 2,710 million # unreliable + 2020: 2,548 million + + capex: + 2022: 5,207 million + 2021: 4,625 million + 2020: 4,240 million + + P&L: + rev: + 2022: 86,392 million # unreliable + 2021: 79,474 million # unreliable + 2020: 70,372 million # unreliable + + op: + 2022: 11,512 million + 2021: 11,162 million + 2020: 10,080 million + + + PFIZER_2021_10K: + BS: + fixed-assets: + 2021: 14,882 million # unreliable + 2020: 13,745 million # unreliable + + + VERIZON_2022_10K: + BS: + cash-and-equiv: + 2022: 2,605 million + 2021: 2,921 million + + st-invest: + 2022: 0 (or not explicitly reported) + 2021: 0 (or not explicitly reported) + + recvables: + 2022: 24,506 million # unreliable + 2021: 23,846 million # unreliable + + invent: + 2022: 2,388 million + 2021: 3,055 million + + curr-assets: + 2022: 37,857 million + 2021: 36,728 million + + fixed-assets: + 2022: 107,434 million + 2021: 99,696 million + + total-assets: + 2022: 379,680 million + 2021: 366,596 million + + curr-liabs: + 2022: 50,171 million + 2021: 47,160 million + + CF: + capex: + 2022: 23,087 million # unreliable + 2021: 20,286 million # unreliable + 2020: 18,192 million # unreliable + + P&L: + rev: + 2022: 136,835 million + 2021: 133,613 million + 2020: 128,292 million + + net: + 2022: 21,256 million # unreliable + 2021: 22,065 million + 2020: 17,801 million + + + WALMART_2018_10K: + BS: + invent: + 2018: 43,783 million + 2017: 43,046 million + + payables: + 2018: 46,092 million + 2017: 41,433 million + + P&L: + cogs: + 2018: 373,396 million # unreliable + 2017: 361,256 million # unreliable + 2016: 360,984 million # unreliable + + + WALMART_2019_10K: + P&L: + rev: + 2019: 514,405 million # unreliable + 2018: 500,343 million # unreliable + 2017: 485,873 million + + op: + 2019: 21,957 million + 2018: 20,437 million + 2017: 22,764 million # unreliable + + + WALMART_2020_10K: + CF: + d&a: + 2020: 10,987 million + 2019: 10,678 million + 2018: 10,529 million + + P&L: + rev: + 2020: 523,964 million # unreliable + 2019: 514,405 million # unreliable + 2018: 500,343 million + + op: + 2020: 20,568 million + 2019: 21,957 million + 2018: 20,437 million diff --git a/examples/FinanceBench-Lite/util.py b/examples/FinanceBench-Lite/util.py new file mode 100644 index 000000000..3025beadb --- /dev/null +++ b/examples/FinanceBench-Lite/util.py @@ -0,0 +1,77 @@ +from __future__ import annotations + +from collections.abc import Callable +from dataclasses import dataclass +from functools import wraps +from typing import TYPE_CHECKING + +from loguru import logger +from tqdm import tqdm + +from data_and_knowledge import FbId, Answer, FB_IDS, DOC_NAMES_BY_FB_ID, QS_BY_FB_ID, OUTPUT_FILE_PATH, get_or_create_output_df # noqa: E501 +from eval import eval_correctness, eval_all +from log import switch_log_file + +if TYPE_CHECKING: + from pandas import DataFrame + + +type QAFunc = Callable[[FbId], Answer] + + +@dataclass +class enable_batch_qa_and_eval: # noqa: N801 + output_name: str + + def __call__(self, qa_func: QAFunc) -> QAFunc: + @wraps(wrapped=qa_func) + def decorated_qa_func(fb_id: FbId) -> Answer | None: + if 'all' in fb_id.lower(): + for _fb_id in tqdm(FB_IDS): + # run inferencing and preliminarily evaluate + eval_correctness(fb_id=_fb_id, answer=qa_func(_fb_id), output_name=self.output_name, human=False) + + # rigorously evaluate again, including human evaluation for difficult cases + eval_all(output_name=self.output_name, refresh=True) + return None + + if 'from:' in fb_id.lower(): + for _fb_id in tqdm(FB_IDS[FB_IDS.index(fb_id[5:]):]): + # run inferencing and preliminarily evaluate + eval_correctness(fb_id=_fb_id, answer=qa_func(_fb_id), output_name=self.output_name, human=False) + + # rigorously evaluate again, including human evaluation for difficult cases + eval_all(output_name=self.output_name, refresh=True) + return None + + # run inferencing and evaluate + eval_correctness(fb_id=fb_id, answer=(answer := qa_func(fb_id)), output_name=self.output_name, human=True) + return answer + + return decorated_qa_func + + +@dataclass +class log_qa_and_update_output_file: # noqa: N801 + output_name: str + + def __call__(self, qa_func: QAFunc) -> QAFunc: + @wraps(wrapped=qa_func) + def decorated_qa_func(fb_id: FbId) -> Answer: + switch_log_file(fb_id=fb_id, output_name=self.output_name) + + logger.info((question := f'\n{fb_id}\n{DOC_NAMES_BY_FB_ID[fb_id]}:\n{QS_BY_FB_ID[fb_id]}\n') + + '\n... solving process starting ...\n', + depth=1) + + logger.info(question + (f'\n{self.output_name.upper()}:\n' + f'{(answer := qa_func(fb_id)).replace('{', '{{').replace('}', '}}')}\n'), + depth=1) + + output_df: DataFrame = get_or_create_output_df() + output_df.loc[fb_id, self.output_name]: str = answer + output_df.to_csv(OUTPUT_FILE_PATH, index=True) + + return answer + + return decorated_qa_func diff --git a/examples/Planning-and-Reasoning.ipynb b/examples/Planning-and-Reasoning.ipynb deleted file mode 100644 index f85d915e9..000000000 --- a/examples/Planning-and-Reasoning.ipynb +++ /dev/null @@ -1,257 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Problem-Solving Agent with Planning, Reasoning & Domain Knowledge: illustrative example using `FinanceBench` financial-analysis dataset\n", - "\n", - "This notebook illustrates the use of `OpenSSA`'s `Agent` and its planning, reasoning & domain knowledge integration capabilities to solve a problem in the financial-analysis domain." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setups" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from pprint import pprint\n", - "from IPython.display import display, Markdown" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "import sys\n", - "\n", - "if cwd_is_root := ('examples' in os.listdir()):\n", - " sys.path.append('examples')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from pathlib import Path\n", - "from dotenv import load_dotenv\n", - "\n", - "load_dotenv(dotenv_path=Path('examples' if cwd_is_root else '.') / '.env')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Imports of Agent, Planning, Reasoning & Resource classes from `OpenSSA`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from openssa import (Agent,\n", - " HTP, AutoHTPlanner,\n", - " OodaReasoner,\n", - " FileResource)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Problem to Solve and Knowledge & Resource available for use" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# problem to solve\n", - "PROBLEM = 'Does AMD have a healthy liquidity profile based on FY22 Quick Ratio?'" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# available domain knowledge (stored as string)\n", - "from FinanceBench.data_and_knowledge import EXPERT_KNOWLEDGE as FINANCIAL_KNOWLEDGE" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# available informational resource: AMD's 2022 10K filing\n", - "\n", - "from FinanceBench.data_and_knowledge import Doc as FinancialDoc\n", - "\n", - "AMD_2022_10K = FileResource(path=FinancialDoc('AMD_2022_10K').dir_path)\n", - "\n", - "display(Markdown(AMD_2022_10K.overview))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Problem-Solving by Agent with Hierarchical Task Planning (HTP) & OODA Reasoning (OODAR)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "agent = Agent(planner=AutoHTPlanner(max_depth=2, max_subtasks_per_decomp=3),\n", - " reasoner=OodaReasoner(),\n", - " resources={AMD_2022_10K})" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Problem-Solving with Automated Dynamic Planning (default)\n", - "\n", - "Without additional domain knowledge and expert inputs, the `agent` can attempt to solve the stated problem by using its Planner to decompose the problem into a 1-level-deep sub-task plan and execute that plan using its OODA Reasoner.\n", - "\n", - "At any point during the OODA reasoning execution, if a confident answer cannot be established for the concerned sub-task, the `agent` would use the Planner again to decompose that sub-task 1 level further. This recursive decomposition can be done up to the `agent`'s maximum allowed planning depth.\n", - "\n", - "This default solving mechanism provides a baseline that is often acceptable for domains that are popularly known/understood." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "solution_from_auto_plan_dynamically_executed = agent.solve(PROBLEM)\n", - "\n", - "display(Markdown(solution_from_auto_plan_dynamically_executed))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Problem-Solving with Expert-Guided Planning\n", - "\n", - "One way to make the solution highly accurate and reliable is to provide the `agent` with plan from a knowledgeable expert:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "expert_plan = HTP.from_dict(\n", - " {\n", - " 'task': PROBLEM,\n", - " 'sub-plans': [\n", - " {\n", - " 'task': 'calculate Quick Ratio conservatively as (`Cash & Cash Equivalents` + `Accounts Receivable`) / Current Liabilities',\n", - " 'sub-plans': [\n", - " {\n", - " 'task': 'retrieve `Cash & Cash Equivalents`, `Accounts Receivable` & `Current Liabilities` from Balance Sheet'\n", - " },\n", - " ]\n", - " },\n", - " {\n", - " 'task': 'see whether Quick Ratio is healthy, i.e. greater than 1'\n", - " },\n", - " ]\n", - " }\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "expert_guided_solution = agent.solve(PROBLEM, plan=expert_plan)\n", - "\n", - "display(Markdown(expert_guided_solution))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Problem-Solving with Domain Knowledge Injection\n", - "\n", - "If expert-guided solution plans are not readily available in your use case, another and sometimes lighter-weight way to achieve consistently good problem-solving outcomes is to give the `agent` access to domain-specific knowledge, so that such knowledge can be used for constructing effective solution plans for problems in the concerned domain, and for reasoning accurately during the execution process:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "agent_with_knowledge = Agent(planner=AutoHTPlanner(max_depth=2, max_subtasks_per_decomp=3),\n", - " reasoner=OodaReasoner(),\n", - " knowledge={FINANCIAL_KNOWLEDGE},\n", - " resources={AMD_2022_10K})" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "solution_from_auto_plan_dynamically_executed_with_knowledge = agent_with_knowledge.solve(PROBLEM, dynamic=False)\n", - "\n", - "display(Markdown(solution_from_auto_plan_dynamically_executed_with_knowledge))" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": ".venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.3" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/examples/Tutorial.ipynb b/examples/Tutorial.ipynb deleted file mode 100644 index 04d9e4a85..000000000 --- a/examples/Tutorial.ipynb +++ /dev/null @@ -1,783 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Build an AI Agent with SEC Filing Insights in Just 10 Minutes Using OpenSSA\n", - "--------------" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### In this tutorial, you will learn how to:\n", - "\n", - "1. Build an AI Agent from scratch with Hierachichy Task Planing (HTP) using openSSA\n", - "2. Improve agent's performance by:\n", - " - Incorporating external knowledge source\n", - " - Providing customized plan from the expert\n", - " - Enabling dynamic solving capability\n", - "\n", - "### By the end of this tutorial, you will understand:\n", - "- What is HTP and how it works?\n", - "- How to customize OpenSSA components to solve your complex problem?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setups" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's start by impporting the neccessary dependencies." - ] - }, - { - "cell_type": "code", - "execution_count": 64, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "The autoreload extension is already loaded. To reload it, use:\n", - " %reload_ext autoreload\n" - ] - } - ], - "source": [ - "%load_ext autoreload\n", - "%autoreload" - ] - }, - { - "cell_type": "code", - "execution_count": 63, - "metadata": {}, - "outputs": [], - "source": [ - "from pathlib import Path\n", - "from pprint import pprint\n", - "import os\n", - "import sys\n", - "\n", - "from IPython.display import display, Markdown\n", - "from dotenv import load_dotenv\n", - "import yaml\n", - "\n", - "from openssa import Agent, HTP, AutoHTPlanner, OodaReasoner, FileResource\n", - "from openssa.utils.llms import OpenAILLM\n", - "from openssa.l2.task import Task" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Make sure you plave your OpenAI API key in `example/.env`\n", - "\n", - "```\n", - "OPENAI_API_KEY=...\n", - "```\n", - "\n", - "[Where do I find my OpenAI API Key?](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key)" - ] - }, - { - "cell_type": "code", - "execution_count": 65, - "metadata": {}, - "outputs": [], - "source": [ - "# make sure we're in the right folder\n", - "if cwd_is_root := ('examples' in os.listdir()):\n", - " sys.path.append('examples')" - ] - }, - { - "cell_type": "code", - "execution_count": 66, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Sanity check if we have the OpenAI API setup: True\n" - ] - } - ], - "source": [ - "print('Sanity check if we have the OpenAI API setup: ', load_dotenv(dotenv_path=Path('examples' if cwd_is_root else '.') / '.env'))" - ] - }, - { - "cell_type": "code", - "execution_count": 41, - "metadata": {}, - "outputs": [], - "source": [ - "# util function to summarize answer\n", - "def summarize_ans(ans, max_tokens=100):\n", - " llm=OpenAILLM()\n", - " response = llm.call(\n", - " messages=[\n", - " {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n", - " {\"role\": \"user\", \"content\": \"Please summarize the following text into 1-2 sentences: \" + ans}\n", - " ],\n", - " max_tokens=max_tokens,\n", - " temperature=0.7\n", - " )\n", - " summary = response.choices[0].message.content\n", - " return summary" - ] - }, - { - "cell_type": "code", - "execution_count": 120, - "metadata": {}, - "outputs": [], - "source": [ - "# util function to print results\n", - "import textwrap\n", - "\n", - "def namestr(obj, namespace):\n", - " return [name for name in namespace if namespace[name] is obj]\n", - "\n", - "def print_solution(sol, present_full_answer=False):\n", - " agent_name = namestr(sol, globals())[0].upper().replace('_', ' ')\n", - " # print(agent_name)\n", - " print('PROBLEM: ')\n", - " print('='*80)\n", - " print(PROBLEM, '\\n')\n", - " if GROUND_TRUTH_ANSWER:\n", - " print('GROUND TRUTH ANSWER: ')\n", - " print('='*80)\n", - " print(GROUND_TRUTH_ANSWER, '\\n')\n", - " if present_full_answer:\n", - " print(f'{agent_name} FULL:')\n", - " print('='*80)\n", - " print(textwrap.fill(sol, 80))\n", - " else:\n", - " print(f'{agent_name} SUMMARIZED:')\n", - " print('='*80)\n", - " print(textwrap.fill(summarize_ans(sol), 80))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Data preparation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We're going to use the FinanceBench dataset to demonstrate. FinanceBench is a dataset to benchmark question answering capability in financial domain.\n", - "\n", - "We have loaded a sample SEC filing for 3M from 2022. \n", - "https://github.com/patronus-ai/financebench/blob/main/pdfs/3M_2022_10K.pdf\n", - "\n", - "- Let's look at a sample question: \n", - "\n", - "`Is 3M a capital-intensive business based on FY2022 data`\n", - "\n", - "- The expected answer for this question is:\n", - "\n", - "`No, the company is managing its CAPEX and Fixed Assets pretty efficiently,\n", - " which is evident from below key metrics:\n", - " CAPEX/Revenue Ratio: 5.1%\n", - " Fixed assets/Total Assets: 20%\n", - " Return on Assets= 12.4%`" - ] - }, - { - "cell_type": "code", - "execution_count": 40, - "metadata": {}, - "outputs": [], - "source": [ - "DOC_PATH = 'sample_data/3M_2022_10K/'\n", - "PROBLEM = 'Is 3M a capital-intensive business based on FY2022 data?'\n", - "GROUND_TRUTH_ANSWER ='''\n", - " No, the company is managing its CAPEX and Fixed Assets pretty efficiently,\n", - " which is evident from below key metrics:\n", - " CAPEX/Revenue Ratio: 5.1%\n", - " Fixed assets/Total Assets: 20%\n", - " Return on Assets= 12.4%'''" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, we'll build an agent from scracth using [OpenSSA](https://www.openssa.org/)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Build an AI Agent from Scratch Using OpenSSA\n", - "------------" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Base Agent" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's build our first agent with all default settings. \n", - "\n", - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To build an agent, the first and most basic resource we need is a document. We will learn how to enable hierarchical task planning (HTP) capability and how to customize it's component later. Let's first build a `Base Agent`` with only the document we've prepared in the previous block and see how well it can solve the question. " - ] - }, - { - "cell_type": "code", - "execution_count": 73, - "metadata": {}, - "outputs": [], - "source": [ - "# Build a base agent\n", - "base_agent = Agent(planner=None,\n", - " reasoner=OodaReasoner(),\n", - " knowledge=None,\n", - " resources={FileResource(path=DOC_PATH)})\n", - "\n", - "base_agent_answer = base_agent.solve(problem=PROBLEM,\n", - " plan=None,\n", - " dynamic=False)" - ] - }, - { - "cell_type": "code", - "execution_count": 121, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "PROBLEM: \n", - "================================================================================\n", - "Is 3M a capital-intensive business based on FY2022 data? \n", - "\n", - "GROUND TRUTH ANSWER: \n", - "================================================================================\n", - "\n", - " No, the company is managing its CAPEX and Fixed Assets pretty efficiently,\n", - " which is evident from below key metrics:\n", - " CAPEX/Revenue Ratio: 5.1%\n", - " Fixed assets/Total Assets: 20%\n", - " Return on Assets= 12.4% \n", - "\n", - "BASE AGENT ANSWER SUMMARIZED:\n", - "================================================================================\n", - "3M's financial statements for FY2022 show significant capital investments in\n", - "property, plant, and equipment (PP&E), with capital expenditures amounting to\n", - "$1,831 million and total assets reported at $46,455 million. The company's focus\n", - "on growth, productivity, and sustainability is reflected in its projected\n", - "capital spending of $1.5 billion to $1.8 billion for 2023, demonstrating a\n", - "commitment to supporting business activities and driving future growth through\n", - "capital investments and strategic resource management practices\n" - ] - } - ], - "source": [ - "print_solution(base_agent_answer)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this example, we can see the default answer is not that good. 3M is not a capital intensive business but the agent failed to answer the question correctly." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## How to Add External Knowledge to the Agent" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's incorporate external knowledge to the base agent. We've prepared a sample expert knowledge in `sample-data/expert-knowledge.txt` file, you can load your own knowledge by replacing the sample file with yours.\n", - "\n", - "" - ] - }, - { - "cell_type": "code", - "execution_count": 96, - "metadata": {}, - "outputs": [], - "source": [ - "with open(file='sample_data/expert-knowledge.txt',\n", - " buffering=-1,\n", - " encoding='utf-8',\n", - " errors='strict',\n", - " newline=None,\n", - " closefd=True,\n", - " opener=None) as f:\n", - " EXPERT_KNOWLEDGE: str = f.read()\n", - "\n", - "EXPERT_KNOWLEDGE_SET = set(EXPERT_KNOWLEDGE.split('\\n\\n'))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In the added knowledge, we've specified \n", - "\n", - "```\n", - "Capital-Intensiveness / Return-on-Capital Metric Formulas\n", - "---------------------------------------------------------\n", - "\n", - "`Capital Intensity Ratio` = `Total Assets` / `(Total) (Net) (Operating) Revenue(s), a.k.a. (Total) (Net) Sales`\n", - "\n", - "`Return on (Total) Assets, a.k.a. RoA or RoTA` = (\n", - " `Net Income, a.k.a. Net Profit, or Net Earnings (or Loss(es)) (Attributable to Shareholders)` /\n", - " `average Total Assets, typically between two consecutive fiscal year-ends`\n", - ")\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's add the knowledge set to our base agent." - ] - }, - { - "cell_type": "code", - "execution_count": 97, - "metadata": {}, - "outputs": [], - "source": [ - "agent_with_knowledge = Agent(planner=None,\n", - " reasoner=OodaReasoner(),\n", - " knowledge=EXPERT_KNOWLEDGE_SET,\n", - " resources={FileResource(path=DOC_PATH)})\n", - "\n", - "agent_with_knowledge_solution = agent_with_knowledge.solve(problem=PROBLEM,\n", - " plan=None,\n", - " dynamic=False)" - ] - }, - { - "cell_type": "code", - "execution_count": 109, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "PROBLEM: \n", - "================================================================================\n", - "Is 3M a capital-intensive business based on FY2022 data? \n", - "\n", - "GROUND TRUTH ANSWER: \n", - "================================================================================\n", - "\n", - " No, the company is managing its CAPEX and Fixed Assets pretty efficiently,\n", - " which is evident from below key metrics:\n", - " CAPEX/Revenue Ratio: 5.1%\n", - " Fixed assets/Total Assets: 20%\n", - " Return on Assets= 12.4% \n", - "\n", - "AGENT WITH KNOWLEDGE SOLUTION SUMMARIZED:\n", - "================================================================================\n", - "Based on the substantial capital expenditures, large asset base, and planned\n", - "future investments in operational infrastructure and capacity enhancement, it is\n", - "reasonable to classify 3M as a capital-intensive business for FY2022.\n" - ] - } - ], - "source": [ - "print_solution(agent_with_knowledge_solution)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Although the final answer is still incorrect, we can see the reasoning behind is getting better when using external resource - the agent can now recognize `assets`` need to be taken into account when looking at capital intensiveness questions." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Get started with HTP by Adding Auto-Plan on top of Knowledge" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can see the agent is improved with added knowledge. Let's enhance it with OpenSSA's HTP feature: `AutoHTPlanner`.\n", - "\n", - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`HTP` is OpenSSA’s default problem-solving task plan structure.\n", - "\n", - "A `HTP` instance is a tree, in which each node can be decomposed into a number of supporting sub-HTPs, each targeting to solve a supporting sub-task.\n", - "\n", - "`HTP` execution involves using a specified Reasoner to work through sub-tasks from the lowest levels and roll up results up to the top level.\n", - "\n", - "There is also a horizontal results-sharing mechanism to enable the execution of a subsequent HTP node to benefit from results from earlier nodes at the same depth level." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`AutoHTPlanner` is OpenSSA’s default Planner to create and update problem-solving HTPs.\n", - "\n", - "Such a planner has an LM for generating new or updated task HTPs, the complexity of which is controlled by 2 key parameters `max_depth` and `max_subtasks_per_decomp`. \n", - "\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "auto_htp_agent_with_knowledge = Agent(planner=AutoHTPlanner(max_depth=2, max_subtasks_per_decomp=4),\n", - " reasoner=OodaReasoner(),\n", - " knowledge=EXPERT_KNOWLEDGE_SET,\n", - " resources={FileResource(path=DOC_PATH)})\n", - "\n", - "auto_htp_agent_with_knowledge_solution = auto_htp_agent_with_knowledge.solve(problem=PROBLEM,\n", - " plan=None,\n", - " dynamic=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can read the full logs of all the intermediate steps in `logs/auto_htp_agent_with_knowledge_logs.txt`" - ] - }, - { - "cell_type": "code", - "execution_count": 110, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "PROBLEM: \n", - "================================================================================\n", - "Is 3M a capital-intensive business based on FY2022 data? \n", - "\n", - "GROUND TRUTH ANSWER: \n", - "================================================================================\n", - "\n", - " No, the company is managing its CAPEX and Fixed Assets pretty efficiently,\n", - " which is evident from below key metrics:\n", - " CAPEX/Revenue Ratio: 5.1%\n", - " Fixed assets/Total Assets: 20%\n", - " Return on Assets= 12.4% \n", - "\n", - "AUTO HTP AGENT WITH KNOWLEDGE SOLUTION SUMMARIZED:\n", - "================================================================================\n", - "Based on the available FY2022 data, 3M's net property, plant, and equipment\n", - "(PP&E) constitutes 19.75% of its total assets, indicating that it may not be\n", - "highly capital-intensive relative to some industries. However, without\n", - "additional information on capital expenditures (CapEx) to sales ratio,\n", - "depreciation and amortization expenses, and return on assets (RoA), a definitive\n", - "assessment of 3M's capital intensity cannot be made.\n" - ] - } - ], - "source": [ - "print_solution(auto_htp_agent_with_knowledge_solution)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can see when breaking down the task into other sub-tasks, the agent gives more concrete reasons to answer the question: `key financial metrics such as the\n", - "proportion of net fixed assets to total assets, capital expenditure relative to\n", - "total net sales, depreciation and amortization expense as a percentage of total\n", - "net sales, and Return on Assets cannot be calculated without specific financial\n", - "data`. However, the final answer is still incorrect - the agent still fails to answer 3M is not a capital-intensive business." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Let's Upgrade the Agent to Solve the Problem Dynamically" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's enable another `HTP` component: `Dynamic` solving. When a problem is solved dynamically, it would be decomposed further if the sub-tasks are still not solvable.\n", - "\n", - "\n", - "" - ] - }, - { - "cell_type": "code", - "execution_count": 103, - "metadata": {}, - "outputs": [], - "source": [ - "dynamic_auto_htp_agent_with_knowledge = Agent(planner=AutoHTPlanner(max_depth=2, max_subtasks_per_decomp=4),\n", - " reasoner=OodaReasoner(),\n", - " knowledge=EXPERT_KNOWLEDGE_SET,\n", - " resources={FileResource(path=DOC_PATH)})\n", - "\n", - "dynamic_auto_htp_agent_with_knowledge_solution = dynamic_auto_htp_agent_with_knowledge.solve(problem=PROBLEM,\n", - " plan=None,\n", - " dynamic=True)" - ] - }, - { - "cell_type": "code", - "execution_count": 111, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "PROBLEM: \n", - "================================================================================\n", - "Is 3M a capital-intensive business based on FY2022 data? \n", - "\n", - "GROUND TRUTH ANSWER: \n", - "================================================================================\n", - "\n", - " No, the company is managing its CAPEX and Fixed Assets pretty efficiently,\n", - " which is evident from below key metrics:\n", - " CAPEX/Revenue Ratio: 5.1%\n", - " Fixed assets/Total Assets: 20%\n", - " Return on Assets= 12.4% \n", - "\n", - "DYNAMIC AUTO HTP AGENT WITH KNOWLEDGE SOLUTION SUMMARIZED:\n", - "================================================================================\n", - "Based on the FY2022 data provided, 3M is identified as a capital-intensive\n", - "business due to its significant capital expenditures, large total asset base,\n", - "focus on environmental expenditures, and structured asset management practices.\n", - "These factors collectively indicate a substantial investment in physical assets\n", - "and operational capabilities characteristic of capital-intensive businesses.\n" - ] - } - ], - "source": [ - "print_solution(dynamic_auto_htp_agent_with_knowledge_solution)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "With the added knowledge, neither solving statistically nore dynamically could help the agent to get to the final answer correctly. Let's customize the most powerful component of `HTP`: the plan." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Incorporating Expert HTP instead of Auto-HTP" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "With OpenSSA, the user can customize the plan instead of depending on the auto-generated plan. Let's add an expert plan on top of our beginning Base Agent to see how it performs. \n", - "\n", - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We've prepared a sample expert plan, but please feel free to customize the expert plan yourself." - ] - }, - { - "cell_type": "code", - "execution_count": 112, - "metadata": {}, - "outputs": [], - "source": [ - "variables = {\n", - " 'COMPANY': '3M',\n", - " 'PERIOD': '2022'\n", - "}\n", - "\n", - "with open('sample_data/expert-plan-templates-sample.yml', 'r') as file:\n", - " EXPERT_PLAN_TEMPLATES_CONTENT = file.read()\n", - "EXPERT_PLAN_TEMPLATES_CONTENT = EXPERT_PLAN_TEMPLATES_CONTENT.format(**variables)\n", - "EXPERT_PLAN = yaml.safe_load(EXPERT_PLAN_TEMPLATES_CONTENT)\n", - "\n", - "EXPERT_HTP = HTP(task=Task.from_dict_or_str(EXPERT_PLAN['task']),\n", - " sub_plans=[HTP.from_dict(d) for d in EXPERT_PLAN.get('sub-plans', [])])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "expert_htp_agent = Agent(planner=AutoHTPlanner(max_depth=2, max_subtasks_per_decomp=4),\n", - " reasoner=OodaReasoner(),\n", - " knowledge=None,\n", - " resources={FileResource(path=DOC_PATH)})\n", - "\n", - "expert_htp_agent_solution = expert_htp_agent.solve(problem=PROBLEM,\n", - " plan=EXPERT_HTP,\n", - " dynamic=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can read the full logs of all the intermediate steps in `logs/expert_htp_agent_logs.txt`" - ] - }, - { - "cell_type": "code", - "execution_count": 114, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "PROBLEM: \n", - "================================================================================\n", - "Is 3M a capital-intensive business based on FY2022 data? \n", - "\n", - "GROUND TRUTH ANSWER: \n", - "================================================================================\n", - "\n", - " No, the company is managing its CAPEX and Fixed Assets pretty efficiently,\n", - " which is evident from below key metrics:\n", - " CAPEX/Revenue Ratio: 5.1%\n", - " Fixed assets/Total Assets: 20%\n", - " Return on Assets= 12.4% \n", - "\n", - "EXPERT HTP AGENT SOLUTION SUMMARIZED:\n", - "================================================================================\n", - "Based on the 2022 fiscal period data, although 3M has a significant investment\n", - "in Net Property, Plant & Equipment and a substantial asset base relative to its\n", - "sales, its Capital Expenditures and Return on Assets metrics do not align with\n", - "typical characteristics of a capital-intensive business. Therefore, 3M does not\n", - "fully exhibit the characteristics of a capital-intensive business according to\n", - "the provided benchmarks.\n" - ] - } - ], - "source": [ - "print_solution(expert_htp_agent_solution)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Yay! By incorporating the expert's plan, we instantly get the correct answer! " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Try It Yourself!" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "So now you've learned how OpenSSA's `HTP` works. You can try different combination of knobs that you can turn, including:\n", - "- auto-plan vs expert-plan\n", - "- statistically solving vs dynamically solving\n", - "- external knowledge vs no external knowledge" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Some tips and tricsk:\n", - "- If you want the fastest way to be up and running with HTP with ok-performance: try auto-plan with added knowledge and dynamically solving.\n", - "- If you want a sufficiently good result with least customization and runtime: try adding expert-plan without anything else\n", - "_ If you want the best result: try adding expert-plan with knowledge and dynamically solving!\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": ".venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.2" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -}