Skip to content

Commit

Permalink
Update docs document phrasing and funnel (#718)
Browse files Browse the repository at this point in the history
* Update mkdocs version

* Update align documentation with argilla SDK 2.0

* Updated naming of basices
Moved CLI to advanced

* Delete unneeded index pages

* Update naming

* Update navigation and content edit

* Update naming of How to guides

* Add popular issue and community page

* Update GITHUB_ACCESS_TOKEN to GH_ACCESS_TOKEN due to protected naming

* Update scoped reqs for token

* Add GH_ACCESS_TOKEN to workflow

* Delete literate nav

* Update jinja templates to hide unrendered navigation

* Update navigation orderin for API reference

* Update docs/sections/how_to_guides/advanced/structured_generation.md

Co-authored-by: Alvaro Bartolome <[email protected]>

* docs: prose in guides (#721)

* docs: make argilla prose talk about argilla

* docs: simplify prose in generator and global steps

* Update LLM page

* Update LLM docs

* Update Pipeline docs

* Avoid using "function"

* Update Step documentation

* Update docs/sections/how_to_guides/basic/step/index.md

Co-authored-by: Alvaro Bartolome <[email protected]>

* Update docs/sections/how_to_guides/basic/step/index.md

Co-authored-by: Alvaro Bartolome <[email protected]>

* Update `Task` page

* Update definiton of `GeneratorTask`

* Update Step documentation

* Update advanced documentation

---------

Co-authored-by: davidberenstein1957 <[email protected]>
Co-authored-by: Agus <[email protected]>
Co-authored-by: Alvaro Bartolome <[email protected]>

* Make `GH_ACCESS_TOKEN` optional

* Add `pandas>=2.0` to `docs`

* Fix typo

* Update default signature GeneratorStep

* Update missing `mkdocs_autorefs` within API reference

* Update API page

* Update CHATML_TEMPLATE formatting to avoid autodoc issues

* Add reference to token scopes required

---------

Co-authored-by: Alvaro Bartolome <[email protected]>
Co-authored-by: burtenshaw <[email protected]>
Co-authored-by: Agus <[email protected]>
Co-authored-by: Gabriel Martín Blázquez <[email protected]>
  • Loading branch information
5 people authored Jun 13, 2024
1 parent 2f245c6 commit ee573fb
Show file tree
Hide file tree
Showing 62 changed files with 1,063 additions and 859 deletions.
4 changes: 4 additions & 0 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,10 @@ jobs:
- run: mike deploy dev --push
if: github.ref == 'refs/heads/develop'
env:
GH_ACCESS_TOKEN: ${{ secrets.GH_ACCESS_TOKEN }}

- run: mike deploy ${{ github.ref_name }} latest --update-aliases --push
if: startsWith(github.ref, 'refs/tags/')
env:
GH_ACCESS_TOKEN: ${{ secrets.GH_ACCESS_TOKEN }}
2 changes: 1 addition & 1 deletion docs/api/cli.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Command Line Interface (CLI)

This section contains the API reference for the CLI. For more information on how to use the CLI, see [Tutorial - CLI](../sections/learn/tutorial/cli/index.md).
This section contains the API reference for the CLI. For more information on how to use the CLI, see [Tutorial - CLI](../sections/how_to_guides/advanced/cli/index.md).

## Utility functions for the `distilabel pipeline` sub-commands

Expand Down
6 changes: 6 additions & 0 deletions docs/api/distiset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Distiset

This section contains the API reference for the Distiset. For more information on how to use the CLI, see [Tutorial - CLI](../sections/how_to_guides/advanced/distiset.md).

:::distilabel.distiset.Distiset
:::distilabel.distiset.create_distiset
3 changes: 3 additions & 0 deletions docs/api/llm/cohere.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# CohereLLM

::: distilabel.llms.cohere
2 changes: 1 addition & 1 deletion docs/api/llm/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@

This section contains the API reference for the `distilabel` LLMs, both for the [`LLM`][distilabel.llms.LLM] synchronous implementation, and for the [`AsyncLLM`][distilabel.llms.AsyncLLM] asynchronous one.

For more information and examples on how to use existing LLMs or create custom ones, please refer to [Tutorial - LLM](../../sections/learn/tutorial/llm/index.md).
For more information and examples on how to use existing LLMs or create custom ones, please refer to [Tutorial - LLM](../../sections/how_to_guides/basic/llm/index.md).

::: distilabel.llms.base
2 changes: 1 addition & 1 deletion docs/api/pipeline/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Pipeline

This section contains the API reference for the `distilabel` pipelines. For an example on how to use the pipelines, see the [Tutorial - Pipeline](../../sections/learn/tutorial/pipeline/index.md).
This section contains the API reference for the `distilabel` pipelines. For an example on how to use the pipelines, see the [Tutorial - Pipeline](../../sections/how_to_guides/basic/pipeline/index.md).

::: distilabel.pipeline.base
::: distilabel.pipeline.local
2 changes: 1 addition & 1 deletion docs/api/step/decorator.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@

This section contains the reference for the `@step` decorator, used to create new [`Step`][distilabel.steps.Step] subclasses without having to manually define the class.

For more information check the [Tutorial - Step](../../sections/learn/tutorial/step/index.md) page.
For more information check the [Tutorial - Step](../../sections/how_to_guides/basic/step/index.md) page.

::: distilabel.steps.decorator
2 changes: 1 addition & 1 deletion docs/api/step/generator_step.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@

This section contains the API reference for the [`GeneratorStep`][distilabel.steps.base.GeneratorStep] class.

For more information and examples on how to use existing generator steps or create custom ones, please refer to [Tutorial - Step - GeneratorStep](../../sections/learn/tutorial/step/generator_step.md).
For more information and examples on how to use existing generator steps or create custom ones, please refer to [Tutorial - Step - GeneratorStep](../../sections/how_to_guides/basic/step/generator_step.md).

::: distilabel.steps.base.GeneratorStep
2 changes: 1 addition & 1 deletion docs/api/step/global_step.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@

This section contains the API reference for the [`GlobalStep`][distilabel.steps.base.GlobalStep] class.

For more information and examples on how to use existing global steps or create custom ones, please refer to [Tutorial - Step - GlobalStep](../../sections/learn/tutorial/step/global_step.md).
For more information and examples on how to use existing global steps or create custom ones, please refer to [Tutorial - Step - GlobalStep](../../sections/how_to_guides/basic/step/global_step.md).

::: distilabel.steps.base.GlobalStep
2 changes: 1 addition & 1 deletion docs/api/step/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This section contains the API reference for the `distilabel` step, both for the [`_Step`][distilabel.steps.base._Step] base class and the [`Step`][distilabel.steps.Step] class.

For more information and examples on how to use existing steps or create custom ones, please refer to [Tutorial - Step](../../sections/learn/tutorial/step/index.md).
For more information and examples on how to use existing steps or create custom ones, please refer to [Tutorial - Step](../../sections/how_to_guides/basic/step/index.md).

::: distilabel.steps.base
options:
Expand Down
2 changes: 1 addition & 1 deletion docs/api/step_gallery/columns.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Columns

This section contains the existing steps intended to be used for commong column operations to apply to the batches.
This section contains the existing steps intended to be used for common column operations to apply to the batches.

::: distilabel.steps.combine
::: distilabel.steps.expand
Expand Down
1 change: 1 addition & 0 deletions docs/api/step_gallery/extra.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# Extra

::: distilabel.steps.generators.data
::: distilabel.steps.deita
::: distilabel.steps.formatting
::: distilabel.steps.typing
7 changes: 7 additions & 0 deletions docs/api/step_gallery/hugging_face.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Hugging Face

This section contains the existing steps integrated with `Hugging Face` so as to easily push the generated datasets to Hugging Face.

::: distilabel.steps.LoadDataFromDisk
::: distilabel.steps.LoadDataFromFileSystem
::: distilabel.steps.LoadDataFromHub
2 changes: 1 addition & 1 deletion docs/api/task/generator_task.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@

This section contains the API reference for the `distilabel` generator tasks.

For more information on how the [`GeneratorTask`][distilabel.steps.tasks.GeneratorTask] works and see some examples, check the [Tutorial - Task - GeneratorTask](../../sections/learn/tutorial/task/generator_task.md) page.
For more information on how the [`GeneratorTask`][distilabel.steps.tasks.GeneratorTask] works and see some examples, check the [Tutorial - Task - GeneratorTask](../../sections/how_to_guides/basic/task/generator_task.md) page.

::: distilabel.steps.tasks.base.GeneratorTask
2 changes: 1 addition & 1 deletion docs/api/task/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This section contains the API reference for the `distilabel` tasks.

For more information on how the [`Task`][distilabel.steps.tasks.Task] works and see some examples, check the [Tutorial - Task](../../sections/learn/tutorial/task/index.md) page.
For more information on how the [`Task`][distilabel.steps.tasks.Task] works and see some examples, check the [Tutorial - Task](../../sections/how_to_guides/basic/task/index.md) page.

::: distilabel.steps.tasks.base
options:
Expand Down
11 changes: 11 additions & 0 deletions docs/api/task/typing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Task Typing

This section contains typing classes implemented in distilabel.

::: distilabel.steps.tasks.typing.ChatType
options:
members:
- _ChatType
- ChatType
::: distilabel.steps.tasks.structured_outputs.outlines.StructuredOutputType
::: distilabel.steps.tasks.structured_outputs.instructor.InstructorStructuredOutputType
1 change: 1 addition & 0 deletions docs/api/task_gallery/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ This section contains the existing [`Task`][distilabel.steps.tasks.Task] subclas
- "!_Task"
- "!GeneratorTask"
- "!ChatType"
- "!typing"
14 changes: 2 additions & 12 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,16 +58,6 @@ Compute is expensive and output quality is important. We help you **focus on dat

Synthesize and judge data with **latest research papers** while ensuring **flexibility, scalability and fault tolerance**. So you can focus on improving your data and training your models.

## 🏘️ Community

We are an open-source community-driven project and we love to hear from you. Here are some ways to get involved:

- [Community Meetup](https://lu.ma/embed-checkout/evt-IQtRiSuXZCIW6FB): listen in or present during one of our bi-weekly events.

- [Slack](https://join.slack.com/t/rubrixworkspace/shared_invite/zt-whigkyjn-a3IUJLD7gDbTZ0rKlvcJ5g): get direct support from the community.

- [Roadmap](https://github.com/orgs/argilla-io/projects/10/views/1): plans change but we love to discuss those with our community so feel encouraged to participate.

## What do people build with Distilabel?

Distilabel is a tool that can be used to **synthesize data and provide AI feedback**. Our community uses Distilabel to create amazing [datasets](https://huggingface.co/datasets?other=distilabel) and [models](https://huggingface.co/models?other=distilabel), and **we love contributions to open-source** ourselves too.
Expand Down Expand Up @@ -113,14 +103,14 @@ Then run:
```python
from distilabel.llms import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadHubDataset
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import TextGeneration

with Pipeline(
name="simple-text-generation-pipeline",
description="A simple text generation pipeline",
) as pipeline:
load_dataset = LoadHubDataset(output_mappings={"prompt": "instruction"})
load_dataset = LoadDataFromHub(output_mappings={"prompt": "instruction"})

generate_with_openai = TextGeneration(llm=OpenAILLM(model="gpt-3.5-turbo"))

Expand Down
180 changes: 180 additions & 0 deletions docs/scripts/gen_popular_issues.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# Copyright 2023-present, Argilla, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os
from datetime import datetime
from typing import List, Union

import pandas as pd
import requests
import mkdocs_gen_files


REPOSITORY = "argilla-io/distilabel"
DATA_PATH = "sections/community/popular_issues.md"

GITHUB_ACCESS_TOKEN = os.getenv(
"GH_ACCESS_TOKEN"
) # public_repo and read:org scopes are required


def fetch_issues_from_github_repository(
repository: str, auth_token: Union[str, None] = None
) -> pd.DataFrame:
if auth_token is None:
return pd.DataFrame(
{
"Issue": [],
"State": [],
"Created at": [],
"Closed at": [],
"Last update": [],
"Labels": [],
"Milestone": [],
"Reactions": [],
"Comments": [],
"URL": [],
"Repository": [],
"Author": [],
}
)

headers = {
"Authorization": f"token {auth_token}",
"Accept": "application/vnd.github.v3+json",
}
issues_data = []

print(f"Fetching issues from '{repository}'...")
with requests.Session() as session:
session.headers.update(headers)

owner, repo_name = repository.split("/")
issues_url = (
f"https://api.github.com/repos/{owner}/{repo_name}/issues?state=all"
)

while issues_url:
response = session.get(issues_url)
issues = response.json()

for issue in issues:
issues_data.append(
{
"Issue": f"{issue['number']} - {issue['title']}",
"State": issue["state"],
"Created at": issue["created_at"],
"Closed at": issue.get("closed_at", None),
"Last update": issue["updated_at"],
"Labels": [label["name"] for label in issue["labels"]],
"Milestone": (issue.get("milestone") or {}).get("title"),
"Reactions": issue["reactions"]["total_count"],
"Comments": issue["comments"],
"URL": issue["html_url"],
"Repository": repo_name,
"Author": issue["user"]["login"],
}
)

issues_url = response.links.get("next", {}).get("url", None)

return pd.DataFrame(issues_data)


def get_org_members(auth_token: Union[str, None] = None) -> List[str]:
if auth_token is None:
return []

headers = {
"Authorization": f"token {auth_token}",
"Accept": "application/vnd.github.v3+json",
}
members_list = []

members_url = "https://api.github.com/orgs/argilla-io/members"

while members_url:
response = requests.get(members_url, headers=headers)
members = response.json()

for member in members:
members_list.append(member["login"])

members_list.extend(["pre-commit-ci[bot]"])

members_url = response.links.get("next", {}).get("url", None)

return members_list


with mkdocs_gen_files.open(DATA_PATH, "w") as f:
df = fetch_issues_from_github_repository(REPOSITORY, GITHUB_ACCESS_TOKEN)

open_issues = df.loc[df["State"] == "open"]
engagement_df = (
open_issues[["URL", "Issue", "Repository", "Reactions", "Comments"]]
.sort_values(by=["Reactions", "Comments"], ascending=False)
.head(10)
.reset_index()
)

members = get_org_members(GITHUB_ACCESS_TOKEN)
community_issues = df.loc[~df["Author"].isin(members)]
community_issues_df = (
community_issues[
["URL", "Issue", "Repository", "Created at", "Author", "State"]
]
.sort_values(by=["Created at"], ascending=False)
.head(10)
.reset_index()
)

planned_issues = df.loc[df["Milestone"].notna()]
planned_issues_df = (
planned_issues[
["URL", "Issue", "Repository", "Created at", "Milestone", "State"]
]
.sort_values(by=["Milestone"], ascending=False)
.head(10)
.reset_index()
)

f.write('=== "Most engaging open issues"\n\n')
f.write(" | Rank | Issue | Reactions | Comments |\n")
f.write(" |------|-------|:---------:|:--------:|\n")
for ix, row in engagement_df.iterrows():
f.write(
f" | {ix+1} | [{row['Issue']}]({row['URL']}) | 👍 {row['Reactions']} | 💬 {row['Comments']} |\n"
)

f.write('\n=== "Latest issues open by the community"\n\n')
f.write(" | Rank | Issue | Author |\n")
f.write(" |------|-------|:------:|\n")
for ix, row in community_issues_df.iterrows():
state = "🟢" if row["State"] == "open" else "🟣"
f.write(
f" | {ix+1} | {state} [{row['Issue']}]({row['URL']}) | by **{row['Author']}** |\n"
)

f.write('\n=== "Planned issues for upcoming releases"\n\n')
f.write(" | Rank | Issue | Milestone |\n")
f.write(" |------|-------|:------:|\n")
for ix, row in planned_issues_df.iterrows():
state = "🟢" if row["State"] == "open" else "🟣"
f.write(
f" | {ix+1} | {state} [{row['Issue']}]({row['URL']}) | **{row['Milestone']}** |\n"
)

today = datetime.today().date()
f.write(f"\nLast update: {today}\n")
44 changes: 44 additions & 0 deletions docs/sections/community/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
hide:
- toc
- footer
---

We are an open-source community-driven project not only focused on building a great product but also on building a great community, where you can get support, share your experiences, and contribute to the project! We would love to hear from you and help you get started with distilabel.

<div class="grid cards" markdown>

- __Slack__

---

In our Slack you can get direct support from the community.


[:octicons-arrow-right-24: Slack ↗](https://join.slack.com/t/rubrixworkspace/shared_invite/zt-whigkyjn-a3IUJLD7gDbTZ0rKlvcJ5g)

- __Community Meetup__

---

We host bi-weekly community meetups where you can listen in or present your work.

[:octicons-arrow-right-24: Community Meetup ↗](https://lu.ma/argilla-event-calendar)

- __Changelog__

---

The changelog is where you can find the latest updates and changes to the distilabel project.

[:octicons-arrow-right-24: Changelog ↗](https://github.com/argilla-io/distilabel/releases)

- __Roadmap__

---

We love to discuss our plans with the community. Feel encouraged to participate in our roadmap discussions.

[:octicons-arrow-right-24: Roadmap ↗](https://github.com/orgs/argilla-io/projects/15)

</div>
Loading

0 comments on commit ee573fb

Please sign in to comment.