python - What the right way to analyse a text for adult content identification? -


i want filter out adult content tweets (or text matter).

for spam detection, have datasets check whether particular text spam or ham.

for adult content, found dataset want use (extract below):

arrbad = [ 'acrotomophilia', 'anal', 'anilingus', 'anus', . . etc. . 'zoophilia'] 

question

how can use dataset filter text instances?

i approach text classification problem, because using blacklists of words typically not work classify full texts. main reason why blacklists don't work have lot of false positives (one example: list contains word 'sexy', alone isn't enough flag document being adults). need training set documents tagged being "adult content" , others "safe work". here do:

  1. check whether existing labelled dataset can used. need several thousands of documents of each class.
  2. if don't find any, create one. instance can create scraper , download reddit content. read instance text classification of nsfw reddit posts
  3. build text classifier nltk. if don't know how, read: learning classify text

Comments

Popular posts from this blog

amazon web services - S3 Pre-signed POST validate file type? -

c# - Check Keyboard Input Winforms -