TiktokenCpp

English | [简体中文]

tiktoken 是 openai 的开源词法分析器，TiktokenCpp 是 C++ 移植版本。TiktokenCpp 采用现代 C++ 语言特性，只为提供与 tiktoken 的 Python 接口几乎一致的接口函数。

✨ 使用

#include "tiktoken.h"

using namespace TiktokenCpp;

//auto encoding = EncodingForModel("GPT-4"); //same as next line
auto encoding = GetEncoding("cl100k_base");

//ordinary text 
std::string text = "tiktoken is great!";
auto tokens = encoding->EncodeOrdinary(text);
PrintTokens(std::cout, tokens);// [83, 1609, 5963, 374, 2294, 0]
std::string dec_text = encoding->Decode(tokens);
assert(dec_text == text);

//non-English text
std::string text = "Hi，试一下中文字符！";
auto tokens = encoding->EncodeOrdinary(text);
PrintTokens(std::cout, tokens);// [13347, 3922, 42421, 15120, 17297, 16325, 17161, 49491, 6447]
std::string dec_text = encoding->Decode(tokens);
assert(dec_text == text);

//special text 1
std::string text = "hello <|endoftext|>";
//allowedSpecial: "all" , disallowedSpecial: "all" (default)
auto tokens = encoding->Encode(text, "all"); 
PrintTokens(std::cout, enc); //[15339, 220, 100257]
auto dec_text = encoding->Decode(tokens);
assert(dec_text == text);

//special text 2
std::string text = "<|endoftext|>";
//allowedSpecial: "<|endoftext|>" , disallowedSpecial: "all" (default)
auto tokens = encoding->Encode(text, {"<|endoftext|>"});
PrintTokens(std::cout, enc); //[100257]
auto dec_text = encoding->Decode(tokens);
assert(dec_text == text);

//special text 3
std::string text = "<|endoftext|>";
//allowedSpecial: null (default), disallowedSpecial: null
auto tokens = encoding->Encode(text, {}, {});
PrintTokens(std::cout, enc); //[27, 91, 8862, 728, 428, 91, 29]
auto dec_text = encoding->Decode(tokens);
assert(dec_text == text);

✨ 下载 encoding 文件

auto encodingnames = ListEncodingNames();
for (const auto& name : encodingnames)
{
    if (!IsLocalEncodingCacheExisted(name))
    {
        std::cout << "Downloading encoding file for " << name << "...";
        bool rtn = DownloadEncoding(name); //DownloadEncoding(name, proxy);
        std::cout << (rtn ? "successfully!" : "failed!") << std::endl;
    }
}

✨ Build

依赖库准备

TiktokenCpp 依赖这些开源的库：

Boost
curl
pcre2
iconv
cppcodec（已经附带在 3rdparty 目录中）

For Linux

git clone https://github.com/inte2000/llm_cpp.git
cd llm_cpp
mkdir build
cd build/
cmake ..
make

如果编译没有问题，可以安装 TiktokenCpp 库：

sudo make install

默认安装位置是 '\usr\local'，可以运行 token_test 测试一下：

cd /usr/local/bin
token_test

输出结果预期如下：

Current encoding cache location: /usr/local/bin/cache
Encoding: "hello world"
[15339, 1917]
Encoding: "goodbye world"
[19045, 29474, 1917]
Encoding: "tiktoken is great!"
[83, 1609, 5963, 374, 2294, 0]
Encoding: "Hi，试一下中文字符！"
[13347, 3922, 42421, 15120, 17297, 16325, 17161, 49491, 6447]
Encoding: "Hello, 😀"
[9906, 11, 91416]
Encoding: ""
[]
Encoding: "hello <|endoftext|>", with allowedSpecial: all , disallowedSpecial: all
[15339, 220, 100257]
Encoding: "<|endoftext|>", with allowedSpecial: <|endoftext|> , disallowedSpecial: all
[100257]
Encoding: "<|endoftext|>", with allowedSpecial: null, disallowedSpecial: null
[27, 91, 8862, 728, 428, 91, 29]
Symbols test for: "tiktoken is great!"
[83, 1609, 5963, 374, 2294, 0]
[t, ik, token,  is,  great, !]

For Windows

使用 git clone 或将源代码包解压缩，比如："d:\code\llm_cpp"。然后根据本地依赖库的编译和安装位置，修改 "d:\code\llm_cpp\vsprj" 目录下的 'boost_1_83_0_static_runtime.props'、libcurl-8.2.1_static.props、libIconv-1.16_static.props 和 pcre2-10.42_static.props 四个属性表文件。以 pcre2-10.42_static.props 属性表文件为例，如果你本地编译的库目录结构与下面的目录结构一致：

$(PCRE2_ROOT)
--lib
  --Debug
    --x86
    --x64
  --Release
    --x86
    --x64

则只需要修改 PCRE2_ROOT 宏，将其替换成 Pcre2 安装的根目录即可：

<PropertyGroup Label="UserMacros">
  <PCRE2_ROOT>D:\development\pcre2-10.42</PCRE2_ROOT>
</PropertyGroup>

否则的话，就需要同时修改 AdditionalLibraryDirectories 属性，使其与本地编译安装的目录结构一致。

使用 Visual Studio 2019 打开 "d:\code\llm_cpp\vsprj\llm_cpp.sln"，依次编译 tiktoken 和 token_test 项目。

在 Windows 平台编译和运行 token_test 需要注意源代码文件的编码问题，TiktokenCpp 使用 UTF-8 编码作为输入文字的编码。在 Windows 平台编译和运行 token_test 或自己的测试项目，需要注意源代码文件使用 UTF-8 编码，否则，就需要进行平台本地编码与 UTF-8 编码的转换。Utf8String.h 提供了 UTF8StrFromLocalMBCS 和 LocalMBCSFromUTF8Str 两个函数，会自动判断平台的代码集，提供相应的转换功能。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_zh.md

README_zh.md

TiktokenCpp

✨ 使用

✨ 下载 encoding 文件

✨ Build

依赖库准备

For Linux

For Windows

Files

README_zh.md

Latest commit

History

README_zh.md

File metadata and controls

TiktokenCpp

✨ 使用

✨ 下载 encoding 文件

✨ Build

依赖库准备

For Linux

For Windows