Classifying Marijuana Abusers with NSDUH
For jupyter notebook code, click here
For those who smoke marijuana, can we predict who has marijuana use disorder?
The disorder, by definition, is a problem of usage. As quoted from the National Institute of Health, "Marijuana use can lead to the development of problem use, known as a marijuana use disorder, which takes the form of addiction in severe cases. Recent data suggest that 30 percent of those who use marijuana may have some degree of marijuana use disorder. Marijuana use disorders are often associated with dependence—in which a person feels withdrawal symptoms when not taking the drug."
The difference between this and addiction can be a little murky. Again from the NIH, "Marijuana use disorder becomes addiction when the person cannot stop using the drug even though it interferes with many aspects of his or her life. Estimates of the number of people addicted to marijuana are controversial, in part because epidemiological studies of substance use often use dependence as a proxy for addiction even though it is possible to be dependent without being addicted." In short, it becomes an addiction once the drug becomes knowingly detrimental to one's way of life, but this also this leaves the door open for the possibility of being a "functional addict."
According to the DSM-IV (the bible of mental disorders), addiction or "substance dependence" is characterized by the following symptoms:
- Tolerance: Same amount will achieve diminished effects.
- Withdrawal: Physiological dependence.
- "Regular" way of life is impaired.
- Inability to quit, even when knowingly causing harm.
- Abuse: take more drugs for a longer time.
Cleaning cryptic data
The survey that I used consisted of 55,271 total participants and 3,148 survey questions that asked numerous questions--many of which were unrelated to marijuana or addiction. A good example of how data can be messy, the dataset had encoded labels and encoded responses. For example, a survey question about the age of first marijuana usage is labeled MJAGE, where the responses are encoded as 1 for "before 14 years old, 2 for "between 15-17," etc. Also, some of the encoded responses were no longer sequential, for example "decline to answer" is 9999. Playing with this dataset really showed me how important it is to take some time to fully understand the ins-and-outs of your dataset before all else.
Feature engineering (The creative part!)
Without getting overly detailed, I wanted to explore other features in addition to those directly related to marijuana usage that involved things like substance abuse history (of alcohol) and family background information. After digesting the dataset, cleaning it, and doing some feature engineering, I ended up with fifteen features:
I then filtered for only participants who marked down having been exposed marijuana sometime in the past (since you can't be evaluated as being dependent on a drug you've never tried before). This took down my number of rows to 7,316 participants.
For my target variable, I used a combination of different survey questions that matched closely with the DSM-IV definition of "addiction." As a baseline, I found about 2,150 out of the 7,316 participants to fall under marijuana use disorder (roughly equal to the 30% estimate observed in the population!).
Splitting data into a training and testing set
Because data is so precious, there needs to be an effective way of splitting up the data into a group that you can train your model on and then another group that you can test the goodness of the trained model. In general, we often hear about the "70-30 split," meaning train with 70 percent and validate with 30 percent of the data. But in my case, I opted for a Stratified KFold for the sole purpose of conserving as many valuable target label "addicted" since there is a significantly greater majority that is marijuana addiction-symptom free.
To build my classifier model that can predict marijuana users as either abusers and non-abusers, I used Random Forest and Gradient Boosting. After experimenting with hyper-parameter tuning, I then plotted a precision-recall curve to compare the "goodness" of my model.
PRECISION VS RECALL
Increasing recall threshold means that I would be classifying more people who are not necessarily addicts (yet) but have the potential, because the model is misclassifying people nonetheless. People who have no predisposition to addiction but the model is still being conservative and saying yes. Higher precision means that I would be catching more and more addicts.
Finding threshold via f1-score
Since there really is a tradeoff between increasing either the recall or threshold value, finding the f1-score instead focuses on the harmonic mean between the two. What I did to find the optimal threshold value (0.266) was plot various threshold values and selected the one that yielded the highest f1-score.
Which features weighed most heavily in my model's prediction algorithm of classifying abusers and non-abusers?
As seen from the graph above, there is a huge disparity between the two furthest features, which interestingly enough were total days the participant used marijuana in the past year and his/her age.
I think that there is still plenty of work that can be done on this massive dataset. For one, one could aggregate more years to increase the sample size. It would also be interesting to see different combinations of features and possibly even a non-tree-based algorithm to try and improve the predicting power of marijuana use disorder.