python - 当我运行该程序时,我不断收到此错误。我已经尝试了我能想到的一切,但仍然不起作用

标签 python pandas scikit-learn

我正在尝试这个 NCAA 篮球预测程序,但我不断收到此错误:

Traceback (most recent call last):
  File "/mnt/chromeos/removable/JACKS JUNK/Chatbot_2/sports_predict.py", line 17, in <module>
    X_train, X_test, y_train, y_test = train_test_split(X, y)
  File "/home/jackmdavis06/.local/lib/python3.5/site-packages/sklearn/model_selection/_split.py", line 2116, in train_test_split
    arrays = indexable(*arrays)
  File "/home/jackmdavis06/.local/lib/python3.5/site-packages/sklearn/utils/validation.py", line 237, in indexable
    check_consistent_length(*result)
  File "/home/jackmdavis06/.local/lib/python3.5/site-packages/sklearn/utils/validation.py", line 212, in check_consistent_length
    " samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [2258, 4148]

这是我的代码:

import pandas as pd
from sportsreference.ncaab.teams import Teams
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

FIELDS_TO_DROP = ['away_points', 'home_points', 'date', 'location',
                  'losing_abbr', 'losing_name', 'winner', 'winning_abbr',
                  'winning_name', 'home_ranking', 'away_ranking']


teams = Teams()


dataset = pd.read_csv('data.csv')
X = dataset.drop(FIELDS_TO_DROP, 1).dropna().drop_duplicates()
y = dataset[['home_points', 'away_points']].values
X_train, X_test, y_train, y_test = train_test_split(X, y)

parameters = {'bootstrap': False,
                'min_samples_leaf': 3,
                'n_estimators': 50,
                'min_samples_split': 10,
                'max_features': 'sqrt',
                'max_depth': 6}
model = RandomForestRegressor(**parameters)
model.fit(X_train, y_train)
print(model.predict(X_test).astype(int), y_test)

我按照该网站上的指南进行操作:

https://towardsdatascience.com/predict-college-basketball-scores-in-30-lines-of-python-148f6bd71894

我稍微调整了代码以使其运行得更快,所以我尝试运行原始代码并且仅运行原始代码,但我得到了完全相同的错误。请帮忙! 谢谢!

最佳答案

您删除了 X 的空值和重复项,但没有删除 y。 如果您 print(X.shape[0], len(y)),您将看到它们具有不同的值。

你应该这样做:


#...
dataset = pd.read_csv('data.csv')

# drop nulls and dublicates
# use fields to keep for your analysis both features and target
# e.g. FIELDS_TO_KEEP = ['a', 'b' ...]
dataset = dataset[FIELDS_TO_KEEP].dropna().drop_duplicates()

# get your feature X, target y
X = dataset[FIELDS_THAT_ARE_FEATURES]
y = dataset[['home_points', 'away_points']]

# ...

关于python - 当我运行该程序时,我不断收到此错误。我已经尝试了我能想到的一切,但仍然不起作用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59446241/

相关文章:

python - 使用 pandas MultiIndex 时如何根据索引值进行插值?

python - 在 Apache 上使用 mod_wsgi、python 和 django 进行热部署

javascript - 使用 python 抓取时访问数据层(JS 变量)

python - 如何解决 "If using all scalar values, you must pass an index"问题pandas

python - 日期变量回归 (python)

python - 在CNN中,如何查看多个filter的权重?

python - Pandas 中的 bool 值和缺失值

python - 使用来自多个表的随机行的 SELECT UNION 查询

python - 即使似乎已安装,也无法导入 scikits-learn

python - 增加Azure ML Studio中/dev/shm的大小