Paper: **Computer-Assisted Keyword and Document Set Discovery from Unstructured Text** (2017, King, Lam & Roberts)
Context & Issue
Keywords help choose documents from a large text corpus for further study
But there are issues
- Humans are unreliable at selecting keywords; Finding keywords by reading large numbers of documents is likely is not reasonable
- Conversational drift often occurs #bostonmarathon → #PrayforBoston
- Selection bias may occur: EX the choice of a keyword list may impact the sentiment of the
document sets chosen —> conclude that social media discourse was extremely negative (❌)

Some definitions
- A hybrid “unsupervised” approach = computer suggestions + human evaluation
- Define the “reference set” R = an example of a single chosen concept of interest (e.g., topic, sentiment, idea, person, organization, event)
- Define the “search set” S = relevant + irrelevant documents (R ∩ S = ∅)
- Goal = identify a “target set” T in S using Kt = (Keywords which select the target set documents)
- Human input crafts the query QT to extract more keywords

The Algorithm
-
Start with high quality keywords to create the reference set (Ex: #bostonmarathon)
- Rank keywords within these documents
- Using different metrics like, doc frequency
- May help collect more keywords and expand the reference set (#bostonmarathon**+ #bombsinboston)**
-
Partition S (search set) into T (target set) and S \ T (non-target set) documents
- Classify documents into whether each document belongs in R or S
- Training set = random docs from R and S
- The mistakes are the estimated target set T we explore further (because they seem to have something in common with the reference set)

-
Find and rank keywords that best discriminate T and S \ T
-
Identifying all unique keywords in S
-
Sort them into those that predict each of the two T and S \ T sets based on document proportions
- So if a keyword occurs more in S/T, we put it into the “S/T list”
- Ex: “bombing” occurs in 6/10 docs (60%) in the T document set and in 100/200 docs (50%) in the S \ T documents —> we put “bombing” in the T set because it occurs 10% more
-
Rank the keywords within lists: How well do these keywords discriminate the two sets?? You can use different metrics.
