tensorflow预测一个人是否已婚

2025年2月11日 作者 unix2go

结合之前的教程,我们可以写一个对应的tensorflow程序,根据输入的条件,来预测一个人是否结婚。

当然,这个程序没有使用机器学习里的逻辑回归算法来做分类,而是创建了一个定制的神经网络(DNN),所以实际用的是深度学习相关知识。

训练数据有10万行,请从本链接下载。

tensorflow程序如下。它的复杂之处在于特征工程,即如何将字符型、数值型变量,转换到对应的tensorflow张量。特征处理好后,建模、训练、评估、预测,就是几句话的事,非常简单。

import os

# TensorFlow is the only backend that supports string inputs.
os.environ["KERAS_BACKEND"] = "tensorflow"

import tensorflow as tf
import pandas as pd
import keras
from keras import layers

dataframe = pd.read_csv('./married.csv')

val_dataframe = dataframe.sample(frac=0.2, random_state=1337)
train_dataframe = dataframe.drop(val_dataframe.index)

print(
    f"Using {len(train_dataframe)} samples for training "
    f"and {len(val_dataframe)} for validation"
)

def dataframe_to_dataset(dataframe):
    dataframe = dataframe.copy()
    labels = dataframe.pop("married")
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    ds = ds.shuffle(buffer_size=len(dataframe))
    return ds


train_ds = dataframe_to_dataset(train_dataframe)
val_ds = dataframe_to_dataset(val_dataframe)

#for x, y in train_ds.take(1):
#    print("Input:", x)
#    print("Target:", y)

train_ds = train_ds.batch(32)
val_ds = val_ds.batch(32)

def encode_numerical_feature(feature, name, dataset):
    # Create a Normalization layer for our feature
    normalizer = layers.Normalization()

    # Prepare a Dataset that only yields our feature
    feature_ds = dataset.map(lambda x, y: x[name])
    feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))

    # Learn the statistics of the data
    normalizer.adapt(feature_ds)

    # Normalize the input feature
    encoded_feature = normalizer(feature)
    return encoded_feature


def encode_categorical_feature(feature, name, dataset, is_string):
    lookup_class = layers.StringLookup if is_string else layers.IntegerLookup
    # Create a lookup layer which will turn strings into integer indices
    lookup = lookup_class(output_mode="binary")

    # Prepare a Dataset that only yields our feature
    feature_ds = dataset.map(lambda x, y: x[name])
    feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))

    # Learn the set of possible string values and assign them a fixed integer index
    lookup.adapt(feature_ds)

    # Turn the string input into integer indices
    encoded_feature = lookup(feature)
    return encoded_feature

# Categorical features encoded as integers
has_car = keras.Input(shape=(1,), name="has_car", dtype="int64")
has_house = keras.Input(shape=(1,), name="has_house", dtype="int64")

# Categorical feature encoded as string
sex = keras.Input(shape=(1,), name="sex", dtype="string")
education = keras.Input(shape=(1,), name="education", dtype="string")

# Numerical features
age = keras.Input(shape=(1,), name="age")
height = keras.Input(shape=(1,), name="height")

all_inputs = [
    age,
    sex,
    height,
    education,
    has_car,
    has_house,
]

# Integer categorical features
has_car_encoded = encode_categorical_feature(has_car, "has_car", train_ds, False)
has_house_encoded = encode_categorical_feature(has_house, "has_house", train_ds, False)

# String categorical features
sex_encoded = encode_categorical_feature(sex, "sex", train_ds, True)
education_encoded = encode_categorical_feature(education, "education", train_ds, True)

# Numerical features
age_encoded = encode_numerical_feature(age, "age", train_ds)
height_encoded = encode_numerical_feature(height, "height", train_ds)

all_features = layers.concatenate(
    [
        age_encoded,
        sex_encoded,
        height_encoded,
        education_encoded,
        has_car_encoded,
        has_house_encoded,
    ]
)

# build the model
x = layers.Dense(16, activation="relu")(all_features)
x = layers.Dropout(0.2)(x)
output = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(all_inputs, output)
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])

# `rankdir='LR'` is to make the graph horizontal.
keras.utils.plot_model(model, show_shapes=True, rankdir="LR")

# training
model.fit(train_ds, epochs=10, validation_data=val_ds)


# prediction
sample = {
    "age": 30,
    "sex": "female",
    "height": 165,
    "education": "doctor",
    "has_car": 0,
    "has_house": 0,
}

input_dict = {name: tf.convert_to_tensor([value]) for name, value in sample.items()}
predictions = model.predict(input_dict)

print(
    f"This person had a {100 * predictions[0][0]:.1f} "
    "percent probability of getting married."
)

模型训练过程如下。可见准确性(accuracy)是逐步提升, 损失率(loss)是逐步减少的。这就说明模型有效收敛了。如果accuracy摇摆不定,那就说明难以收敛,大多数是训练数据有问题。

Epoch 1/10
2500/2500 ━━━━━━━━━━━━━━━━━━━━ 9s 3ms/step - accuracy: 0.7725 - loss: 0.4526 - val_accuracy: 0.8997 - val_loss: 0.2466
Epoch 2/10
2500/2500 ━━━━━━━━━━━━━━━━━━━━ 11s 3ms/step - accuracy: 0.8896 - loss: 0.2586 - val_accuracy: 0.9204 - val_loss: 0.2023
Epoch 3/10
2500/2500 ━━━━━━━━━━━━━━━━━━━━ 7s 3ms/step - accuracy: 0.9025 - loss: 0.2275 - val_accuracy: 0.9327 - val_loss: 0.1758
Epoch 4/10
2500/2500 ━━━━━━━━━━━━━━━━━━━━ 11s 3ms/step - accuracy: 0.9114 - loss: 0.2087 - val_accuracy: 0.9379 - val_loss: 0.1595
Epoch 5/10
2500/2500 ━━━━━━━━━━━━━━━━━━━━ 9s 3ms/step - accuracy: 0.9144 - loss: 0.1984 - val_accuracy: 0.9455 - val_loss: 0.1455
Epoch 6/10
2500/2500 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.9209 - loss: 0.1855 - val_accuracy: 0.9464 - val_loss: 0.1368
Epoch 7/10
2500/2500 ━━━━━━━━━━━━━━━━━━━━ 10s 3ms/step - accuracy: 0.9199 - loss: 0.1823 - val_accuracy: 0.9473 - val_loss: 0.1285
Epoch 8/10
2500/2500 ━━━━━━━━━━━━━━━━━━━━ 8s 3ms/step - accuracy: 0.9233 - loss: 0.1756 - val_accuracy: 0.9491 - val_loss: 0.1217
Epoch 9/10
2500/2500 ━━━━━━━━━━━━━━━━━━━━ 7s 3ms/step - accuracy: 0.9240 - loss: 0.1727 - val_accuracy: 0.9495 - val_loss: 0.1168
Epoch 10/10
2500/2500 ━━━━━━━━━━━━━━━━━━━━ 7s 3ms/step - accuracy: 0.9232 - loss: 0.1679 - val_accuracy: 0.9491 - val_loss: 0.1151

如下提供另一个版本的实现。没有使用pandas的dataset,而是用的dataframe,并用了sklearn的特征编码方法。这个版本的神经网络结构,比上述版本增加了隐藏层,看起来效果更好一些。

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

# Load the data
file_path = './married.csv'  # Replace with path to your CSV file
data = pd.read_csv(file_path)

# Preprocessing
# Encode categorical features ('sex', 'education')
label_encoders = {}
categorical_features = ['sex', 'education']

for feature in categorical_features:
    le = LabelEncoder()
    data[feature] = le.fit_transform(data[feature])
    label_encoders[feature] = le

# Separate features (X) and target (y)
X = data.drop(columns=['married'])  # Drop the target column
y = data['married']  # Target column

# Normalize numerical features (age, height, etc.)
scaler = StandardScaler()
X[['age', 'height']] = scaler.fit_transform(X[['age', 'height']])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build the model
model = Sequential([
    Dense(64, input_dim=X_train.shape[1], activation='relu'),
    Dropout(0.3),
    Dense(32, activation='relu'),
    Dense(16, activation='relu'),
    Dense(1, activation='sigmoid')  # Output layer for binary classification
])

# Compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=20, batch_size=32, validation_split=0.2, verbose=1)

# Evaluate the model
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f"Test Accuracy: {test_accuracy:.4f}")

# Make predictions
predictions = model.predict(X_test)
predictions = (predictions > 0.5).astype(int)  # Convert probabilities to binary predictions

# Optional: Save the model
model.save('marriage_classifier_model.h5')

# Sample model prediction
sample_input = X_test.iloc[:1]  # Take one example from the test set
sample_prediction = model.predict(sample_input)
print(f"Sample Prediction: {sample_prediction[0][0]:.4f} (Married: {sample_prediction[0][0] > 0.5})")

它的训练过程如下。可以见到模型收敛的很好。

2000/2000 ━━━━━━━━━━━━━━━━━━━━ 7s 3ms/step - accuracy: 0.8264 - loss: 0.3772 - val_accuracy: 0.9611 - val_loss: 0.1055
Epoch 2/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 6s 3ms/step - accuracy: 0.9434 - loss: 0.1354 - val_accuracy: 0.9790 - val_loss: 0.0641
Epoch 3/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - accuracy: 0.9628 - loss: 0.0938 - val_accuracy: 0.9858 - val_loss: 0.0412
Epoch 4/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - accuracy: 0.9723 - loss: 0.0717 - val_accuracy: 0.9865 - val_loss: 0.0339
Epoch 5/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 3ms/step - accuracy: 0.9751 - loss: 0.0642 - val_accuracy: 0.9919 - val_loss: 0.0277
Epoch 6/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 3ms/step - accuracy: 0.9791 - loss: 0.0553 - val_accuracy: 0.9941 - val_loss: 0.0195
Epoch 7/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - accuracy: 0.9825 - loss: 0.0472 - val_accuracy: 0.9953 - val_loss: 0.0201
Epoch 8/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 3ms/step - accuracy: 0.9829 - loss: 0.0448 - val_accuracy: 0.9983 - val_loss: 0.0125
Epoch 9/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - accuracy: 0.9857 - loss: 0.0374 - val_accuracy: 0.9974 - val_loss: 0.0134
Epoch 10/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - accuracy: 0.9862 - loss: 0.0376 - val_accuracy: 0.9959 - val_loss: 0.0128
Epoch 11/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - accuracy: 0.9871 - loss: 0.0352 - val_accuracy: 0.9966 - val_loss: 0.0110
Epoch 12/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 6s 2ms/step - accuracy: 0.9869 - loss: 0.0348 - val_accuracy: 0.9964 - val_loss: 0.0118
Epoch 13/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - accuracy: 0.9885 - loss: 0.0320 - val_accuracy: 0.9992 - val_loss: 0.0063
Epoch 14/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - accuracy: 0.9898 - loss: 0.0278 - val_accuracy: 0.9989 - val_loss: 0.0081
Epoch 15/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - accuracy: 0.9891 - loss: 0.0298 - val_accuracy: 0.9963 - val_loss: 0.0103
Epoch 16/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 3ms/step - accuracy: 0.9904 - loss: 0.0263 - val_accuracy: 0.9939 - val_loss: 0.0167
Epoch 17/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 10s 2ms/step - accuracy: 0.9901 - loss: 0.0275 - val_accuracy: 0.9962 - val_loss: 0.0098
Epoch 18/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - accuracy: 0.9905 - loss: 0.0258 - val_accuracy: 0.9967 - val_loss: 0.0091
Epoch 19/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - accuracy: 0.9914 - loss: 0.0223 - val_accuracy: 0.9974 - val_loss: 0.0055
Epoch 20/20
2000/2000 ━━━━━━━━━━━━━━━━━━━━ 5s 3ms/step - accuracy: 0.9927 - loss: 0.0204 - val_accuracy: 0.9986 - val_loss: 0.0058
Test Accuracy: 0.9985