让模型更智能：GPT-4和Scikit-Learn

ChatGPT-4与Scikit-Learn的无缝集成介绍

ChatGPT中文站 — Image by Author with @MidJourney

ChatGPT 允许方便高效地构建文本分类模型。Scikit-learn 是创建机器学习模型的 Python 常用库。两者的结合，加上 Scikit-LLM，可以创建更强大的模型，而无需手动使用 OpenAI 的 API 进行交互。

一些常见的自然语言处理 (NLP) 任务是分类和标记。这些任务通常需要收集有标记的数据，进行模型训练，终点部署和推理设置。这可能是耗时和昂贵的，通常需要多个模型来搜索各种任务。

大语言模型（LLMs）如ChatGPT为这些自然语言处理（NLP）任务带来了新的解决方案。我们可以通过使用提示工程，使用单个模型处理广泛的NLP任务，而不是为每个任务训练和部署一个模型。

随着我们深入探讨利用ChatGPT技术制作多类别、多标签文本分类模型的过程。我们将介绍一个有用的新库scikit-LLM，它是OpenAI API的scikit-learn包装器，可以让我们像创建常规scikit-learn模型一样创建强大的模型。让我们开始吧！

设置

让我们开始安装scikit-LLM包；使用pip，poetry或您喜欢的软件包管理器：

pip install scikit-llm

获取OpenAI API密钥

为了充分利用scikit-LLM的功能，我们提供了OpenAI API密钥。让我们导入配置模块并指定我们的密钥：

# Import SKLLMConfig to configure OpenAI API (key and organization)
from skllm.config import SKLLMConfig

# Set your OpenAI API key
SKLLMConfig.set_openai_key("<YOUR_KEY>")

# Set your OpenAI organization (optional)
SKLLMConfig.set_openai_org("<YOUR_ORGANIZATION>")

如果你想跟随，请考虑这个：

免费的OpenAI试用不足，因为我们需要每分钟超过三个请求。请先切换到“按使用计费”的计划。
请确保向SKLLMConfig.set_openai_org提供您的组织ID，而不是名称。您可以在此处找到ID：https://platform.openai.com/account/org-settings。

我们已经准备好了。让我们开始制作模型吧！

零样本GPT分类器

文本分类是ChatGPT最令人印象深刻的功能之一。它甚至可以提供零-shot分类，这不需要特定的培训任务，而是依靠描述性标签来执行分类。这可以使用ZeroShotGPTClassifier类来完成：

# Importing the ZeroShotGPTClassifier module and the classification dataset
from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset

# Get the classification dataset from scikit-learn
X, y = get_classification_dataset()

# Define the model
clf = ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")

# Fit the data
clf.fit(X, y)

# Make predictions
labels = clf.predict(X)

Scikit-LLM 确保响应包含有效的标签，如果响应缺少标签，scikit-LLM 将考虑标签频率概率随机选择一个标签。

Scikit-LLM负责API相关的方面，并确保您获得可用的标签。它甚至处理缺失的标签！

多标签零样本文本分类

在前一章中，我们看到了零射分类，但这也可以通过多标签方法实现。Scikit-LLM不仅可以应用单个标签，还可以通过结合其NLP能力，混合和匹配现有标签，以找到更细致的标签。

# Importing the MultiLabelZeroShotGPTClassifier module and the classification dataset
from skllm import MultiLabelZeroShotGPTClassifier
from skllm.datasets import get_multilabel_classification_dataset

# Get the multilabel classification dataset from scikit-learn
X, y = get_multilabel_classification_dataset()

# Define the model
clf = MultiLabelZeroShotGPTClassifier(max_labels=3)

# Fit the model
clf.fit(X, y)

# Make predictions
labels = clf.predict(X)

在代码中，Zero-Shot和Multi-Label Zero-Shot之间唯一的区别就在于你使用哪个类。为了执行Multi-Label，我们使用MultiLabelZeroShotGPTClassifier类并指定max_labels；在本例中，我们将其限制为最多3个标签。

文本向量化

另一个NLP任务是将文本数据转换为数字表示，以便机器可以理解和进一步分析。该过程称为文本矢量化，也是scikit-LLM的能力之一。以下是如何使用GPTVectorizer进行文本矢量化的示例：

# Importing the GPTVectorizer class from the skllm.preprocessing module
from skllm.preprocessing import GPTVectorizer

# Creating an instance of the GPTVectorizer class 
# and assigning it to the variable 'model'
model = GPTVectorizer()  

# Transforming the text data
vectors = model.fit_transform(X)

与任何常规的Scikit-Learn模型一样，我们可以使用fit_transform方法拟合模型并使用它来转换文本。

让我们来提升一下！

从GPTVectorizer输出的结果可用于机器学习流程。在本例中，我们将它用于为XGBoost分类器准备数据，该分类器对文本进行预处理和分类：

# Importing the necessary modules and classes
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier

# Creating an instance of LabelEncoder class
le = LabelEncoder()

# Encoding the training labels 'y_train' using LabelEncoder
y_train_encoded = le.fit_transform(y_train)

# Encoding the test labels 'y_test' using LabelEncoder
y_test_encoded = le.transform(y_test)

# Defining the steps of the pipeline as a list of tuples
steps = [('GPT', GPTVectorizer()), ('Clf', XGBClassifier())]

# Creating a pipeline with the defined steps
clf = Pipeline(steps)

# Fitting the pipeline on the training data 'X_train' 
# and the encoded training labels 'y_train_encoded'
clf.fit(X_train, y_train_encoded)

# Predicting the labels for the test data 'X_test' 
# using the trained pipeline
yh = clf.predict(X_test)

首先，我们应用文本向量化；然后使用XGBoost进行分类。我们对训练标签进行编码，并在训练数据上执行流水线以预测测试数据中的标签。太棒了！

文本概括

我们的最后一个例子是一个非常常用的NLP任务，文本摘要。 ChatGPT 在这些常见的NLP任务中非常高效，并在与语言相关的任何事情上都表现出色。 Scikit-LLM 提供了一个有用的 GPTSummarizer 模块，可以以两种方式使用：独立或作为预处理流水线的一部分。让我们看看它能做什么：

# Importing the GPTSummarizer class from the skllm.preprocessing module
from skllm.preprocessing import GPTSummarizer

# Importing the get_summarization_dataset function
from skllm.datasets import get_summarization_dataset

# Calling the get_summarization_dataset function to retrieve input data 'X'
X = get_summarization_dataset()

# Creating an instance of the GPTSummarizer
s = GPTSummarizer(openai_model='gpt-3.5-turbo', max_words=15)

# Applying the fit_transform method of the GPTSummarizer 
# instance to the input data 'X'.
# It fits the model to the data and generates the summaries, 
# which are assigned to the variable 'summaries'
summaries = s.fit_transform(X)

需要注意的一点是，我们设置了max_words参数为15。这作为生成摘要所需单词数量的灵活限制。它提供了一个大概的目标长度，但并不是严格执行的，这可能会导致摘要超过指定的限制。