Context & Issue

Keywords help choose documents from a large text corpus for further study

But there are issues

Humans are unreliable at selecting keywords; Finding keywords by reading large numbers of documents is likely is not reasonable
Conversational drift often occurs #bostonmarathon → #PrayforBoston
Selection bias may occur: EX the choice of a keyword list may impact the sentiment of the document sets chosen —> conclude that social media discourse was extremely negative (❌)

Screen Shot 2022-05-12 at 10.06.17 AM.png

Some definitions

A hybrid “unsupervised” approach = computer suggestions + human evaluation
Define the “reference set” R = an example of a single chosen concept of interest (e.g., topic, sentiment, idea, person, organization, event)
Define the “search set” S = relevant + irrelevant documents (R ∩ S = ∅)
Goal = identify a “target set” T in S using Kt = (Keywords which select the target set documents)
Human input crafts the query QT to extract more keywords

Screen Shot 2022-05-12 at 11.31.47 AM.png

Start with high quality keywords to create the reference set (Ex: #bostonmarathon)
1. Rank keywords within these documents
  1. Using different metrics like, doc frequency
  2. May help collect more keywords and expand the reference set (#bostonmarathon**+ #bombsinboston)**
Partition S (search set) into T (target set) and S \ T (non-target set) documents
1. Classify documents into whether each document belongs in R or S
2. Training set = random docs from R and S
3. The mistakes are the estimated target set T we explore further (because they seem to have something in common with the reference set)
Find and rank keywords that best discriminate T and S \ T
1. Identifying all unique keywords in S
2. Sort them into those that predict each of the two T and S \ T sets based on document proportions
  1. So if a keyword occurs more in S/T, we put it into the “S/T list”
  2. Ex: “bombing” occurs in 6/10 docs (60%) in the T document set and in 100/200 docs (50%) in the S \ T documents —> we put “bombing” in the T set because it occurs 10% more
3. Rank the keywords within lists: How well do these keywords discriminate the two sets?? You can use different metrics.