python - 具有分类特征的线性回归分析

标签 python regression linear-regression

回归算法在表示为数字时运行良好。如何对包含数字的数据进行回归并预测输出是非常清楚的。但是,我需要对包含分类特征的数据进行回归分析。我有一个 csv 文件,其中包含两列 install-id 和 page-name 都是对象类型。我需要将 install-id 作为输入,并将页面名称预测为输出。下面是我的代码。请帮助我。

import pandas as pd
data = pd.read_csv("/Users/kashifjilani/Downloads/csv/newjsoncontent.csv")
X = data["install-id"]
Y = data["endPoint"]
X = pd.get_dummies(data=X, drop_first=True)
from sklearn import linear_model
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .20, random_state = 40)
regr = linear_model.LinearRegression()
regr.fit(X_train, Y_train)
predicted = regr.predict(X_test)

最佳答案

为了演示,假设您有这个数据框,其中 IQGender是输入特征。目标变量是 Test Score .

|   Student |   IQ | Gender   |   Test Score |
|----------:|-----:|:---------|-------------:|
|         1 |  125 | Male     |           93 |
|         2 |  120 | Female   |           86 |
|         3 |  115 | Male     |           96 |
|         4 |  110 | Female   |           81 |
|         5 |  105 | Male     |           92 |
|         6 |  100 | Female   |           75 |
|         7 |   95 | Male     |           84 |
|         8 |   90 | Female   |           77 |
|         9 |   85 | Male     |           73 |
|        10 |   80 | Female   |           74 |

在这里,IQ是数字且 Gender是一个分类特征。在预处理步骤中,我们将在数值特征上应用简单的输入器,在分类特征上应用单热编码器。您可以使用sklearn's Pipeline & ColumnTransformer的功能。然后您可以使用您选择的模型轻松训练和预测。

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn import linear_model

# defining the data
d = {
    "Student": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "IQ": [125, 120, 115, 110, 105, 100, 95, 90, 85, 80,],
    "Gender": [
        "Male",
        "Female",
        "Male",
        "Female",
        "Male",
        "Female",
        "Male",
        "Female",
        "Male",
        "Female",
    ],
    "Test Score": [93, 86, 96, 81, 92, 75, 84, 77, 73, 74],
}

# converting into pandas dataframe
df = pd.DataFrame(d)

# setting the student id as index to keep track
df = df.set_index("Student")

# column transformation
categorical_columns = ["Gender"]
numerical_columns = ["IQ"]

# determine X
X = df[categorical_columns + numerical_columns]
y = df["Test Score"]

# train test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42, test_size=0.3
)

# categorical pipeline
categorical_pipe = Pipeline([("onehot", OneHotEncoder(handle_unknown="ignore"))])

# numerical pipeline
numerical_pipe = Pipeline([("imputer", SimpleImputer(strategy="mean")),])

# aggregating both the pipeline
preprocessing = ColumnTransformer(
    [
        ("cat", categorical_pipe, categorical_columns),
        ("num", numerical_pipe, numerical_columns),
    ]
)


rf = Pipeline(
    [("preprocess", preprocessing), ("classifier", linear_model.LinearRegression())]
)

# train
rf.fit(X_train, y_train)

# predict
predict = rf.predict(X_test)

这表明,

>> array([84.48275862, 84.55172414, 79.13793103])

关于python - 具有分类特征的线性回归分析,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60181397/

相关文章:

python - 如何在 Python 中用零向右填充数字字符串?

python - Scikit-Learn:Std.Error,来自 LinearRegression 的 p 值

r - 查找选定列中多个点的斜率

regression - 在Python中模拟回归线的数据

r - mxnet LinearRegressionOutput 性能不佳

r - `lm` : how to get prediction variance of sum of predicted values 的线性模型

python - 使用 selenium python 和 Firefox 重新打开相同的浏览器窗口

Python - 返回前验证电话号码

python - 表单对象没有属性 'save_m2m' django

R - 分析分类变量对连续变量的影响