tensorflow预测一个人是否已婚
2025年2月11日结合之前的教程,我们可以写一个对应的tensorflow程序,根据输入的条件,来预测一个人是否结婚。
当然,这个程序没有使用机器学习里的逻辑回归算法来做分类,而是创建了一个定制的神经网络(DNN),所以实际用的是深度学习相关知识。
训练数据有10万行,请从本链接下载。
tensorflow程序如下。它的复杂之处在于特征工程,即如何将字符型、数值型变量,转换到对应的tensorflow张量。特征处理好后,建模、训练、评估、预测,就是几句话的事,非常简单。
import os
# TensorFlow is the only backend that supports string inputs.
os.environ["KERAS_BACKEND"] = "tensorflow"
import tensorflow as tf
import pandas as pd
import keras
from keras import layers
dataframe = pd.read_csv('./married.csv')
val_dataframe = dataframe.sample(frac=0.2, random_state=1337)
train_dataframe = dataframe.drop(val_dataframe.index)
print(
f"Using {len(train_dataframe)} samples for training "
f"and {len(val_dataframe)} for validation"
)
def dataframe_to_dataset(dataframe):
dataframe = dataframe.copy()
labels = dataframe.pop("married")
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
ds = ds.shuffle(buffer_size=len(dataframe))
return ds
train_ds = dataframe_to_dataset(train_dataframe)
val_ds = dataframe_to_dataset(val_dataframe)
#for x, y in train_ds.take(1):
# print("Input:", x)
# print("Target:", y)
train_ds = train_ds.batch(32)
val_ds = val_ds.batch(32)
def encode_numerical_feature(feature, name, dataset):
# Create a Normalization layer for our feature
normalizer = layers.Normalization()
# Prepare a Dataset that only yields our feature
feature_ds = dataset.map(lambda x, y: x[name])
feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))
# Learn the statistics of the data
normalizer.adapt(feature_ds)
# Normalize the input feature
encoded_feature = normalizer(feature)
return encoded_feature
def encode_categorical_feature(feature, name, dataset, is_string):
lookup_class = layers.StringLookup if is_string else layers.IntegerLookup
# Create a lookup layer which will turn strings into integer indices
lookup = lookup_class(output_mode="binary")
# Prepare a Dataset that only yields our feature
feature_ds = dataset.map(lambda x, y: x[name])
feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))
# Learn the set of possible string values and assign them a fixed integer index
lookup.adapt(feature_ds)
# Turn the string input into integer indices
encoded_feature = lookup(feature)
return encoded_feature
# Categorical features encoded as integers
has_car = keras.Input(shape=(1,), name="has_car", dtype="int64")
has_house = keras.Input(shape=(1,), name="has_house", dtype="int64")
# Categorical feature encoded as string
sex = keras.Input(shape=(1,), name="sex", dtype="string")
education = keras.Input(shape=(1,), name="education", dtype="string")
# Numerical features
age = keras.Input(shape=(1,), name="age")
height = keras.Input(shape=(1,), name="height")
all_inputs = [
age,
sex,
height,
education,
has_car,
has_house,
]
# Integer categorical features
has_car_encoded = encode_categorical_feature(has_car, "has_car", train_ds, False)
has_house_encoded = encode_categorical_feature(has_house, "has_house", train_ds, False)
# String categorical features
sex_encoded = encode_categorical_feature(sex, "sex", train_ds, True)
education_encoded = encode_categorical_feature(education, "education", train_ds, True)
# Numerical features
age_encoded = encode_numerical_feature(age, "age", train_ds)
height_encoded = encode_numerical_feature(height, "height", train_ds)
all_features = layers.concatenate(
[
age_encoded,
sex_encoded,
height_encoded,
education_encoded,
has_car_encoded,
has_house_encoded,
]
)
# build the model
x = layers.Dense(16, activation="relu")(all_features)
x = layers.Dropout(0.2)(x)
output = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(all_inputs, output)
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
# `rankdir='LR'` is to make the graph horizontal.
keras.utils.plot_model(model, show_shapes=True, rankdir="LR")
# training
model.fit(train_ds, epochs=10, validation_data=val_ds)
# prediction
sample = {
"age": 30,
"sex": "female",
"height": 165,
"education": "doctor",
"has_car": 0,
"has_house": 0,
}
input_dict = {name: tf.convert_to_tensor([value]) for name, value in sample.items()}
predictions = model.predict(input_dict)
print(
f"This person had a {100 * predictions[0][0]:.1f} "
"percent probability of getting married."
)
模型训练过程如下。可见准确性(accuracy)是逐步提升, 损失率(loss)是逐步减少的。这就说明模型有效收敛了。如果accuracy摇摆不定,那就说明难以收敛,大多数是训练数据有问题。
Epoch 1/10
2500/2500 ━━━━━━━━━━━━━━━━━━━━ 9s 3ms/step - accuracy: 0.7725 - loss: 0.4526 - val_accuracy: 0.8997 - val_loss: 0.2466
Epoch 2/10
2500/2500 ━━━━━━━━━━━━━━━━━━━━ 11s 3ms/step - accuracy: 0.8896 - loss: 0.2586 - val_accuracy: 0.9204 - val_loss: 0.2023
Epoch 3/10
2500/2500 ━━━━━━━━━━━━━━━━━━━━ 7s 3ms/step - accuracy: 0.9025 - loss: 0.2275 - val_accuracy: 0.9327 - val_loss: 0.1758
Epoch 4/10
2500/2500 ━━━━━━━━━━━━━━━━━━━━ 11s 3ms/step - accuracy: 0.9114 - loss: 0.2087 - val_accuracy: 0.9379 - val_loss: 0.1595
Epoch 5/10
2500/2500 ━━━━━━━━━━━━━━━━━━━━ 9s 3ms/step - accuracy: 0.9144 - loss: 0.1984 - val_accuracy: 0.9455 - val_loss: 0.1455
Epoch 6/10
2500/2500 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.9209 - loss: 0.1855 - val_accuracy: 0.9464 - val_loss: 0.1368
Epoch 7/10
2500/2500 ━━━━━━━━━━━━━━━━━━━━ 10s 3ms/step - accuracy: 0.9199 - loss: 0.1823 - val_accuracy: 0.9473 - val_loss: 0.1285
Epoch 8/10
2500/2500 ━━━━━━━━━━━━━━━━━━━━ 8s 3ms/step - accuracy: 0.9233 - loss: 0.1756 - val_accuracy: 0.9491 - val_loss: 0.1217
Epoch 9/10
2500/2500 ━━━━━━━━━━━━━━━━━━━━ 7s 3ms/step - accuracy: 0.9240 - loss: 0.1727 - val_accuracy: 0.9495 - val_loss: 0.1168
Epoch 10/10
2500/2500 ━━━━━━━━━━━━━━━━━━━━ 7s 3ms/step - accuracy: 0.9232 - loss: 0.1679 - val_accuracy: 0.9491 - val_loss: 0.1151
如下提供另一个版本的实现。没有使用pandas的dataset,而是用的dataframe,并用了sklearn的特征编码方法。这个版本的神经网络结构,比上述版本增加了隐藏层,看起来效果更好一些。
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
# Load the data
file_path = './married.csv' # Replace with path to your CSV file
data = pd.read_csv(file_path)
# Preprocessing
# Encode categorical features ('sex', 'education')
label_encoders = {}
categorical_features = ['sex', 'education']
for feature in categorical_features:
le = LabelEncoder()
data[feature] = le.fit_transform(data[feature])
label_encoders[feature] = le
# Separate features (X) and target (y)
X = data.drop(columns=['married']) # Drop the target column
y = data['married'] # Target column
# Normalize numerical features (age, height, etc.)
scaler = StandardScaler()
X[['age', 'height']] = scaler.fit_transform(X[['age', 'height']])
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Build the model
model = Sequential([
Dense(64, input_dim=X_train.shape[1], activation='relu'),
Dropout(0.3),
Dense(32, activation='relu'),
Dense(16, activation='relu'),
Dense(1, activation='sigmoid') # Output layer for binary classification
])
# Compile the model
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
# Train the model
history = model.fit(X_train, y_train, epochs=20, batch_size=32, validation_split=0.2, verbose=1)
# Evaluate the model
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f"Test Accuracy: {test_accuracy:.4f}")
# Make predictions
predictions = model.predict(X_test)
predictions = (predictions > 0.5).astype(int) # Convert probabilities to binary predictions
# Optional: Save the model
model.save('marriage_classifier_model.h5')
# Sample model prediction
sample_input = X_test.iloc[:1] # Take one example from the test set
sample_prediction = model.predict(sample_input)
print(f"Sample Prediction: {sample_prediction[0][0]:.4f} (Married: {sample_prediction[0][0] > 0.5})")
它的训练过程如下。可以见到模型收敛的很好。
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 7s 3ms/step - accuracy: 0.8264 - loss: 0.3772 - val_accuracy: 0.9611 - val_loss: 0.1055
Epoch 2/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 6s 3ms/step - accuracy: 0.9434 - loss: 0.1354 - val_accuracy: 0.9790 - val_loss: 0.0641
Epoch 3/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - accuracy: 0.9628 - loss: 0.0938 - val_accuracy: 0.9858 - val_loss: 0.0412
Epoch 4/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - accuracy: 0.9723 - loss: 0.0717 - val_accuracy: 0.9865 - val_loss: 0.0339
Epoch 5/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 3ms/step - accuracy: 0.9751 - loss: 0.0642 - val_accuracy: 0.9919 - val_loss: 0.0277
Epoch 6/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 3ms/step - accuracy: 0.9791 - loss: 0.0553 - val_accuracy: 0.9941 - val_loss: 0.0195
Epoch 7/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - accuracy: 0.9825 - loss: 0.0472 - val_accuracy: 0.9953 - val_loss: 0.0201
Epoch 8/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 3ms/step - accuracy: 0.9829 - loss: 0.0448 - val_accuracy: 0.9983 - val_loss: 0.0125
Epoch 9/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - accuracy: 0.9857 - loss: 0.0374 - val_accuracy: 0.9974 - val_loss: 0.0134
Epoch 10/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - accuracy: 0.9862 - loss: 0.0376 - val_accuracy: 0.9959 - val_loss: 0.0128
Epoch 11/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - accuracy: 0.9871 - loss: 0.0352 - val_accuracy: 0.9966 - val_loss: 0.0110
Epoch 12/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 6s 2ms/step - accuracy: 0.9869 - loss: 0.0348 - val_accuracy: 0.9964 - val_loss: 0.0118
Epoch 13/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - accuracy: 0.9885 - loss: 0.0320 - val_accuracy: 0.9992 - val_loss: 0.0063
Epoch 14/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - accuracy: 0.9898 - loss: 0.0278 - val_accuracy: 0.9989 - val_loss: 0.0081
Epoch 15/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - accuracy: 0.9891 - loss: 0.0298 - val_accuracy: 0.9963 - val_loss: 0.0103
Epoch 16/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 3ms/step - accuracy: 0.9904 - loss: 0.0263 - val_accuracy: 0.9939 - val_loss: 0.0167
Epoch 17/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 10s 2ms/step - accuracy: 0.9901 - loss: 0.0275 - val_accuracy: 0.9962 - val_loss: 0.0098
Epoch 18/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - accuracy: 0.9905 - loss: 0.0258 - val_accuracy: 0.9967 - val_loss: 0.0091
Epoch 19/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - accuracy: 0.9914 - loss: 0.0223 - val_accuracy: 0.9974 - val_loss: 0.0055
Epoch 20/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 3ms/step - accuracy: 0.9927 - loss: 0.0204 - val_accuracy: 0.9986 - val_loss: 0.0058
Test Accuracy: 0.9985