Paper: **Computer-Assisted Keyword and Document Set Discovery from Unstructured Text** (2017, King, Lam & Roberts)

Context & Issue

Keywords help choose documents from a large text corpus for further study

But there are issues

Screen Shot 2022-05-12 at 10.06.17 AM.png

Some definitions

Screen Shot 2022-05-12 at 11.31.47 AM.png

The Algorithm

  1. Start with high quality keywords to create the reference set (Ex: #bostonmarathon)

    1. Rank keywords within these documents
      1. Using different metrics like, doc frequency
      2. May help collect more keywords and expand the reference set (#bostonmarathon**+ #bombsinboston)**
  2. Partition S (search set) into T (target set) and S \ T (non-target set) documents

    1. Classify documents into whether each document belongs in R or S
    2. Training set = random docs from R and S
    3. The mistakes are the estimated target set T we explore further (because they seem to have something in common with the reference set)

    Screen Shot 2022-05-12 at 11.46.15 AM.png

  3. Find and rank keywords that best discriminate T and S \ T

    1. Identifying all unique keywords in S

    2. Sort them into those that predict each of the two T and S \ T sets based on document proportions

      1. So if a keyword occurs more in S/T, we put it into the “S/T list”
      2. Ex: “bombing” occurs in 6/10 docs (60%) in the T document set and in 100/200 docs (50%) in the S \ T documents —> we put “bombing” in the T set because it occurs 10% more
    3. Rank the keywords within lists: How well do these keywords discriminate the two sets?? You can use different metrics.

      Screen Shot 2022-05-12 at 12.31.14 PM.png