Nlp Data Math Python

NLP Exploration: Topic Models

While Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) are the most popular topic models, there are many other models and extensions to LDA that can be used to mine more data from text data.

Getting Startedđź”—

Topic models are a type of unsupervised machine learning for text data that finds topics in a group of text documents. The topic models find these topics by many different ways, but the best non-mathematical/intuitive mental model I have is this: Topic models identify topics by tracking which words often occur together in the same document. The more documents that words share, the more likely they will be in a topic together. So, for example, if a topic was identified having the words “zebra”, “rhino”, “lion”, “elephant”, and “cheetah”, we could infer that the topic is likely “African wildlife”.

There are plenty of excellent guides to getting started with preprocessing text data, and the most common methods of topic modelling like Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). My goal is to cover some other types of topic models, some of which are extensions to Latent Dirichlet Allocation (LDA), and some which are entirely new.

Model Configurationđź”—

Before considering specific models, I ought to quickly cover what some of the parameters to topic models are, what they mean, and when we might want to adjust them. The models I cover allow the developer to modify the alpha and eta parameters.

eta lets the developer what the default prior probability will be of each word to the topic. The eta sets the symmetric prior of a word to each given topic of the model, using a number between 0 and infinite. By default, each word is assigned priors making it equally likely to be a part of each topic, mathematically determined as 1 / topic_count.

Some words are more likely to be a part of certain topics than others. For example, “Nike” has more to do with athletics and Greek mythology than nutrition, and “flour” has more to do with nutrition than Greek mythology. This is called an asymmetric prior, and we would use the set_word_priors(word, prior) method to set this, which is available in most models, to inform our model that some words are known to be less/more associated that certain topics than others.

tomotopy defines alpha parameter as hyperparameter of Dirichlet distribution for document topic. For a detailed explanation of how alpha relates to symmetric and asymmetric priors, please read this answer on Stack Overflow. But to keep things simple, alpha is a value that indicates to the model how similar the documents are in terms of the topics they contain. If an alpha value is low, this indicates the documents have very different topic distributions and the topics in each document vary, whereas a high value indicates the documents contain similar topic distributions and the topics in each document are similar.

LDA Extensionsđź”—

LDA was major breakthrough in the area of topic modelling by using a probabilistic approach to determining topics. This led to many researchers extending LDA to refine results, and to find other types of information about documents. These are the models that I will start with:

I will not spend much time (if any) explaining the math behind these models. I will focus on providing an intuition about how these models work, and how to actually use them.

I have ran all the code in this post on a Lenovo T410 with 4GB of memory, and these models were trained between 5 and 20 minutes, so you should be able to run them on most computers. This is because the tomotopy topic modelling library which already implements these models is very efficiently written for both memory and CPU efficiency.

Hierarchical LDAđź”—

Hierarchical LDA (HLDA) learns a hierarchy of topics, where each parent topic in the hierarchy is a more general topic of it’s children, establishing a relationship between the discovered topics. For example, if two topics share a parent topic then the topics are relatively similar, whereas if they do not, then the topics may differ more significantly.

HLDA is only implemented in the tomotopy library. HLDA should not be confused with HLDP which, while it uses a hierarchy within the model, does not produce topics with a hierarchical structure.

from tomotopy as tp

file_name = 'running_articles.clean.txt'

mdl = tp.HLDAModel(depth=3, min_cf=100)
for line in open(file_name, 'r'):
    document = line.strip().split()
    mdl.add_doc(document)

print('Training model by iterating over the corpus 100 times, 10 iterations at a time')
iterations = 10
for i in range(0, 100, iterations):
    mdl.train(iterations)
    print('Iteration: #{}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

for k in range(mdl.k):
    if not mdl.is_live_topic(k):
        continue
    print('child of topic #%s - Level: %r, number of documents' % (mdl.parent_topic(k), mdl.level(k), mdl.num_docs_of_topic(k)) )
    print('Top 10 words of global topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

The above model produced 1888 topics, so I will just include topics that had were in over 100 documents:

parent_topics = [k for k in range(mdl.k) if mdl.children_topics(k) and mdl.num_docs_of_topic(parent_topic) > 100]
    for parent_topic in parent_topics:
        child_topics = [child_topic for child_topic in mdl.children_topics(parent_topic) if mdl.num_docs_of_topic(child_topic) > 100]
        if child_topics:
            print('\n\n')
        print('Top 10 words of level %s parent topic #%s of %s documents: %r' % (mdl.level(parent_topic), parent_topic, mdl.num_docs_of_topic(parent_topic), mdl.get_topic_words(parent_topic, top_n=10)))

        for child_topic in child_topics:
            print('    Top 10 words of child topic #%s: %r' % (child_topic, mdl.get_topic_words(child_topic, top_n=10)))
Top 10 words of level 0 parent topic #0 of 79342 documents: [('run', 0.016999663785099983), ('time', 0.01686132699251175), ('race', 0.012128877453505993), ('athlete', 0.011808005161583424), ('win', 0.00961347110569477), ('training', 0.007174599915742874), ('coach', 0.007075145840644836), ('world', 0.006760003976523876), ('track', 0.006721531972289085), ('start', 0.006342952139675617)]
    Top 10 words of child topic #579 of 161 documents: [('que', 0.09543485939502716), ('del', 0.0463360920548439), ('para', 0.04391573369503021), ('bolt', 0.036308880895376205), ('con', 0.03146816045045853), ('iaaf', 0.030430860817432404), ('de_la', 0.02973932959139347), ('en_el', 0.025590138509869576), ('como', 0.02386130765080452), ('una', 0.02386130765080452)]
    Top 10 words of child topic #544 of 262 documents: [('und', 0.07650472968816757), ('der', 0.06943455338478088), ('die', 0.055475492030382156), ('das', 0.029732801020145416), ('ein', 0.02882636897265911), ('nicht', 0.02864508144557476), ('ich', 0.025744497776031494), ('ist', 0.024294205009937286), ('den', 0.022662628442049026), ('hatte', 0.02121233567595482)]
    Top 10 words of child topic #512 of 4131 documents: [('iaaf', 0.03794722259044647), ('athlete', 0.02330145053565502), ('sport', 0.016651879996061325), ('des', 0.008209897205233574), ('programme', 0.008107739500701427), ('coach', 0.00799629371613264), ('competition', 0.006696098484098911), ('include', 0.006306040100753307), ('les', 0.005767387803643942), ('athletic', 0.005674516782164574)]
    Top 10 words of child topic #437 of 125 documents: [('athlete', 0.04406151548027992), ('coach', 0.040754735469818115), ('athletic', 0.025918906554579735), ('time', 0.024846436455845833), ('compete', 0.02323773317039013), ('sport', 0.020735302940011024), ('jump', 0.01796475611627102), ('help', 0.016892287880182266), ('answ', 0.016624169424176216), ('level', 0.011887429282069206)]
    Top 10 words of child topic #434 of 1296 documents: [('time', 0.03169409558176994), ('run', 0.025730175897479057), ('race', 0.022309115156531334), ('nsw', 0.020539602264761925), ('australian', 0.01690882071852684), ('clock', 0.01683017611503601), ('athlete', 0.016305875033140182), ('compete', 0.014837834052741528), ('win', 0.014392178505659103), ('event', 0.012740632519125938)]
    Top 10 words of child topic #415 of 145 documents: [('record', 0.13075846433639526), ('set', 0.0553889200091362), ('performance', 0.03385476768016815), ('mark', 0.03231661394238472), ('usa', 0.029240306466817856), ('previous_mark', 0.027702152729034424), ('list', 0.02308768965303898), ('age', 0.021549535915255547), ('russian', 0.018473228439688683), ('note', 0.01693507470190525)]
    Top 10 words of child topic #245 of 171 documents: [('race', 0.09784720838069916), ('event', 0.040654584765434265), ('marathon', 0.027900831773877144), ('runner', 0.022719619795680046), ('half_marathon', 0.016741296276450157), ('run', 0.013353580608963966), ('write', 0.01255646999925375), ('cancel', 0.011958638206124306), ('lottery', 0.011560083366930485), ('receive', 0.011161528527736664)]
    Top 10 words of child topic #217 of 412 documents: [('run', 0.050233639776706696), ('exercise', 0.03616838529706001), ('runner', 0.02813109941780567), ('muscle', 0.016311557963490486), ('workout', 0.014834114350378513), ('help', 0.014302235096693039), ('time', 0.013061183504760265), ('improve', 0.012765695340931416), ('leg', 0.012174718081951141), ('jump', 0.010992763563990593)]
    Top 10 words of child topic #196 of 1734 documents: [('study', 0.017713475972414017), ('exercise', 0.01592574268579483), ('body', 0.009414955973625183), ('people', 0.008743579499423504), ('result', 0.00800975039601326), ('researcher', 0.007611608598381281), ('athlete', 0.006963652558624744), ('help', 0.006659191567450762), ('eat', 0.006354730110615492), ('time', 0.006229822989553213)]
    Top 10 words of child topic #193 of 196 documents: [('run', 0.03880670294165611), ('runner', 0.032584354281425476), ('injury', 0.030242612585425377), ('pain', 0.01799863949418068), ('knee', 0.01351587288081646), ('help', 0.013315152376890182), ('treatment', 0.012311547994613647), ('time', 0.011843198910355568), ('muscle', 0.010438153520226479), ('leg', 0.009300734847784042)]
    Top 10 words of child topic #167 of 3605 documents: [('shoe', 0.050075653940439224), ('run', 0.0304963868111372), ('feel', 0.02082579955458641), ('foot', 0.01544780284166336), ('runner', 0.013564104214310646), ('fit', 0.009180476889014244), ('trail', 0.009166471660137177), ('road', 0.00758388452231884), ('tester', 0.00736680394038558), ('look', 0.006904632318764925)]
    Top 10 words of child topic #160 of 160 documents: [('athlete', 0.08462043851613998), ('event', 0.08146046847105026), ('race', 0.03616759181022644), ('compete', 0.023176612332463264), ('entry', 0.023176612332463264), ('hold', 0.021772179752588272), ('competition', 0.019665535539388657), ('final', 0.01931442692875862), ('relay', 0.017207782715559006), ('list', 0.017207782715559006)]
    Top 10 words of child topic #141 of 971 documents: [('eat', 0.018273433670401573), ('food', 0.015966806560754776), ('run', 0.01318087987601757), ('help', 0.011668091639876366), ('body', 0.008163215592503548), ('don', 0.007848675362765789), ('runner', 0.007818719372153282), ('diet', 0.007609026040881872), ('time', 0.006815186236053705), ('calorie', 0.006710339803248644)]
    Top 10 words of child topic #140 of 142 documents: [('race', 0.06461523473262787), ('win', 0.03598868474364281), ('run', 0.030161138623952866), ('finish', 0.02719624526798725), ('lead', 0.023106737062335014), ('ahead', 0.021879885345697403), ('japan', 0.021573171019554138), ('start', 0.021061982959508896), ('time', 0.020448558032512665), ('marathon', 0.019835131242871284)]
    Top 10 words of child topic #95 of 407 documents: [('event', 0.02210836671292782), ('finish', 0.02103995345532894), ('win', 0.016519738361239433), ('woman', 0.013807610608637333), ('record', 0.013355589471757412), ('time', 0.013067939318716526), ('jump', 0.011876245960593224), ('school', 0.011588596738874912), ('mark', 0.011218761093914509), ('effort', 0.008794281631708145)]
    Top 10 words of child topic #66 of 336 documents: [('strictly_advance_notice_basis', 0.08189871907234192), ('competition_test_conduct', 0.07952501624822617), ('impose_follow_sanction_violation', 0.07893159240484238), ('guilty_follow_dope_violation', 0.06468937546014786), ('protect_sport_athletic_threat', 0.06112881749868393), ('doping_iaaf_spend_dollar', 0.06112881749868393), ('iaaf_inform', 0.06053538993000984), ('iaaf_rule_presence_prohibit_substance', 0.05104057490825653), ('iaaf_rule_dope_violation_relate', 0.05044714733958244), ('anabolic_agents_dope_control_sample', 0.04391946271061897)]
    Top 10 words of child topic #35 of 387 documents: [('women', 0.08514520525932312), ('race', 0.0733507052063942), ('result', 0.05639611929655075), ('mile', 0.03907295688986778), ('results', 0.03206997364759445), ('woman', 0.026909880340099335), ('heat', 0.02101263217628002), ('jump', 0.019169742241501808), ('national', 0.01806400716304779), ('live_update', 0.016958273947238922)]
    Top 10 words of child topic #33 of 181 documents: [('win', 0.036080773919820786), ('ncaa', 0.0272316075861454), ('mile', 0.025189492851495743), ('senior', 0.024849141016602516), ('finish', 0.023147378116846085), ('time', 0.02280702441930771), ('run', 0.020424555987119675), ('rank', 0.020084204152226448), ('junior', 0.01804208755493164), ('spot', 0.01702103018760681)]
    Top 10 words of child topic #15 of 986 documents: [('finish', 0.027456147596240044), ('win', 0.024502897635102272), ('woman', 0.023487716913223267), ('runner', 0.02088824287056923), ('race', 0.019580814987421036), ('lead', 0.019565433263778687), ('title', 0.015612385235726833), ('spot', 0.012136164121329784), ('time', 0.011736244894564152), ('meet', 0.011120984330773354)]
    Top 10 words of child topic #14 of 203 documents: [('usa', 0.06642155349254608), ('gbr', 0.04223703220486641), ('ger', 0.03960828110575676), ('rus', 0.033474527299404144), ('jump', 0.032948777079582214), ('jam', 0.030845774337649345), ('aus', 0.026639770716428757), ('ken', 0.0241862703114748), ('pol', 0.023310018703341484), ('esp', 0.021908018738031387)]
    Top 10 words of child topic #11 of 9633 documents: [('race', 0.04882875457406044), ('win', 0.027336107566952705), ('lead', 0.021664613857865334), ('run', 0.020966583862900734), ('time', 0.019519198685884476), ('finish', 0.01815136894583702), ('woman', 0.012608190067112446), ('pace', 0.011445662006735802), ('runner', 0.010747632011771202), ('fast', 0.009395199827849865)]
    Top 10 words of child topic #10 of 15794 documents: [('run', 0.043738700449466705), ('time', 0.015453550033271313), ('runner', 0.01428512018173933), ('training', 0.010350468568503857), ('race', 0.008522744290530682), ('feel', 0.007481011562049389), ('workout', 0.007412970531731844), ('help', 0.006702058482915163), ('week', 0.006053321994841099), ('start', 0.005782330874353647)]
    Top 10 words of child topic #9 of 21034 documents: [('win', 0.033552832901477814), ('jump', 0.02770446054637432), ('competition', 0.018576201051473618), ('event', 0.017395440489053726), ('record', 0.012598484754562378), ('woman', 0.01154245249927044), ('throw', 0.011418648064136505), ('world', 0.011275442317128181), ('lead', 0.010591746307909489), ('time', 0.010560333728790283)]
    Top 10 words of child topic #8 of 13649 documents: [('run', 0.02928888611495495), ('win', 0.026827260851860046), ('time', 0.023596594110131264), ('race', 0.021910877898335457), ('finish', 0.013195629231631756), ('meet', 0.011933918111026287), ('record', 0.01098291389644146), ('mile', 0.01088678278028965), ('event', 0.010455912910401821), ('season', 0.009935778565704823)]



Top 10 words of level 1 parent topic #8 of 13649 documents: [('run', 0.02928888611495495), ('win', 0.026827260851860046), ('time', 0.023596594110131264), ('race', 0.021910877898335457), ('finish', 0.013195629231631756), ('meet', 0.011933918111026287), ('record', 0.01098291389644146), ('mile', 0.01088678278028965), ('event', 0.010455912910401821), ('season', 0.009935778565704823)]
    Top 10 words of child topic #627 of 178 documents: [('race', 0.04199102520942688), ('marathon', 0.0393008291721344), ('win', 0.036552149802446365), ('record', 0.031171759590506554), ('run', 0.03099631331861019), ('finish', 0.029241837561130524), ('runner', 0.02444627322256565), ('woman', 0.022282419726252556), ('junior', 0.02041097916662693), ('japanese', 0.019182847812771797)]
    Top 10 words of child topic #184 of 189 documents: [('run', 0.023872308433055878), ('leg', 0.021950924769043922), ('foot', 0.016012106090784073), ('exercise', 0.013159142807126045), ('time', 0.013042694889008999), ('body', 0.012868023477494717), ('start', 0.012344010174274445), ('muscle', 0.011994668282568455), ('repeat', 0.011179535649716854), ('runner', 0.010830193758010864)]
    Top 10 words of child topic #128 of 160 documents: [('view_athlete_profile', 0.03221642225980759), ('final', 0.029352843761444092), ('iaaf', 0.0241029541939497), ('que', 0.01956895925104618), ('heat', 0.01730196177959442), ('gallery', 0.016228120774030685), ('window_location_href_indexof', 0.016108805313706398), ('result_rank_score', 0.016108805313706398), ('iaaf_iaaf_sectionselect', 0.016108805313706398), ('home_iaaf_navselect', 0.016108805313706398)]
    Top 10 words of child topic #77 of 552 documents: [('finish', 0.03318801149725914), ('season', 0.02703164331614971), ('runner', 0.022526409476995468), ('run', 0.019601544365286827), ('race', 0.01677103154361248), ('cross_country', 0.016464391723275185), ('national', 0.014388681389391422), ('look', 0.01415280532091856), ('track', 0.013256476260721684), ('return', 0.011770456098020077)]
    Top 10 words of child topic #16 of 9824 documents: [('race', 0.04116473346948624), ('win', 0.029112061485648155), ('run', 0.025310928001999855), ('time', 0.020452061668038368), ('finish', 0.018457001075148582), ('woman', 0.015471267513930798), ('marathon', 0.013898384757339954), ('lead', 0.011606121435761452), ('world', 0.010540767572820187), ('fast', 0.010223752819001675)]



Top 10 words of level 1 parent topic #9 of 21034 documents: [('win', 0.033552832901477814), ('jump', 0.02770446054637432), ('competition', 0.018576201051473618), ('event', 0.017395440489053726), ('record', 0.012598484754562378), ('woman', 0.01154245249927044), ('throw', 0.011418648064136505), ('world', 0.011275442317128181), ('lead', 0.010591746307909489), ('time', 0.010560333728790283)]
    Top 10 words of child topic #1047 of 119 documents: [('world', 0.07614222168922424), ('champion', 0.06514512002468109), ('world_champion', 0.06514512002468109), ('olympic', 0.0338456854224205), ('world_record_holder', 0.02792416885495186), ('follow_page', 0.026232309639453888), ('olympic_champion', 0.02284858748316765), ('usa', 0.02115672640502453), ('vote', 0.01946486346423626), ('finalist', 0.01946486346423626)]
    Top 10 words of child topic #1063 of 354 documents: [('meeting', 0.06618096679449081), ('athlete', 0.05612284690141678), ('win', 0.04533587768673897), ('event', 0.0437324084341526), ('iaaf', 0.023178860545158386), ('jump', 0.022304242476820946), ('season', 0.01924307458102703), ('world_record', 0.01807691715657711), ('world', 0.01691075973212719), ('golden_league', 0.01661921851336956)]
    Top 10 words of child topic #17 of 19287 documents: [('win', 0.03213324770331383), ('time', 0.021331870928406715), ('race', 0.019910166040062904), ('run', 0.017953863367438316), ('world', 0.017116088420152664), ('final', 0.013780237175524235), ('fast', 0.012455405667424202), ('lead', 0.011456174775958061), ('woman', 0.011364683508872986), ('finish', 0.011046257801353931)]



Top 10 words of level 1 parent topic #10 of 15794 documents: [('run', 0.043738700449466705), ('time', 0.015453550033271313), ('runner', 0.01428512018173933), ('training', 0.010350468568503857), ('race', 0.008522744290530682), ('feel', 0.007481011562049389), ('workout', 0.007412970531731844), ('help', 0.006702058482915163), ('week', 0.006053321994841099), ('start', 0.005782330874353647)]
    Top 10 words of child topic #714 of 416 documents: [('race', 0.08588524907827377), ('event', 0.0463140495121479), ('runner', 0.03856364265084267), ('marathon', 0.03522403538227081), ('run', 0.021172480657696724), ('finisher', 0.02079441212117672), ('half_marathon', 0.01827395334839821), ('woman', 0.013800138607621193), ('road', 0.012602920643985271), ('participant', 0.011909794993698597)]
    Top 10 words of child topic #422 of 205 documents: [('race', 0.051014937460422516), ('marathon', 0.03314032778143883), ('runner', 0.02470511943101883), ('event', 0.022295059636235237), ('include', 0.019885001704096794), ('official', 0.016069073230028152), ('photo', 0.016069073230028152), ('feature', 0.014864043332636356), ('run', 0.014462366700172424), ('provide', 0.011650629341602325)]
    Top 10 words of child topic #281 of 460 documents: [('run', 0.06980745494365692), ('pace', 0.04125266894698143), ('race', 0.040672775357961655), ('workout', 0.030814575031399727), ('mile', 0.025555534288287163), ('marathon', 0.01939665898680687), ('training', 0.017976917326450348), ('minute', 0.016557177528738976), ('fast', 0.016397206112742424), ('time', 0.015797315165400505)]
    Top 10 words of child topic #253 of 193 documents: [('shoe', 0.05986529216170311), ('gear', 0.04217037931084633), ('run', 0.040287941694259644), ('sale', 0.03313467651605606), ('look', 0.02824033610522747), ('deal', 0.025981411337852478), ('include', 0.02071058377623558), ('check', 0.01958112046122551), ('offer', 0.013557318598031998), ('running_shoe', 0.013557318598031998)]
    Top 10 words of child topic #82 of 182 documents: [('race', 0.06477952748537064), ('runner', 0.053903233259916306), ('marathon', 0.0361492782831192), ('run', 0.027992060407996178), ('event', 0.02671249583363533), ('charity', 0.02383347786962986), ('boston_marathon', 0.01583620347082615), ('raise', 0.014236749149858952), ('program', 0.012477348558604717), ('community', 0.012157456949353218)]
    Top 10 words of child topic #73 of 181 documents: [('athlete', 0.05292687192559242), ('iaaf', 0.027750289067626), ('sport', 0.020399462431669235), ('russian', 0.017642902210354805), ('report', 0.015437654219567776), ('russia', 0.014243144541978836), ('ban', 0.012772979214787483), ('wada', 0.011027159169316292), ('track_field', 0.010016419924795628), ('compete', 0.010016419924795628)]
    Top 10 words of child topic #18 of 11844 documents: [('run', 0.05022493749856949), ('race', 0.03767874464392662), ('runner', 0.01810699887573719), ('marathon', 0.016719095408916473), ('mile', 0.015035257674753666), ('time', 0.012781347148120403), ('finish', 0.009854130446910858), ('start', 0.0087151313200593), ('people', 0.008232233114540577), ('course', 0.007440783549100161)]



Top 10 words of level 1 parent topic #11 of 9633 documents: [('race', 0.04882875457406044), ('win', 0.027336107566952705), ('lead', 0.021664613857865334), ('run', 0.020966583862900734), ('time', 0.019519198685884476), ('finish', 0.01815136894583702), ('woman', 0.012608190067112446), ('pace', 0.011445662006735802), ('runner', 0.010747632011771202), ('fast', 0.009395199827849865)]
    Top 10 words of child topic #1824 of 323 documents: [('marathon', 0.0742373988032341), ('race', 0.04619068652391434), ('run', 0.04347999021410942), ('runner', 0.04145599901676178), ('win', 0.03574545681476593), ('record', 0.026131508871912956), ('finish', 0.023999091237783432), ('fast', 0.02370995096862316), ('japanese', 0.02226424403488636), ('personal', 0.02049325406551361)]
    Top 10 words of child topic #19 of 9070 documents: [('race', 0.044566452503204346), ('win', 0.033450666815042496), ('marathon', 0.019214965403079987), ('finish', 0.01887447200715542), ('time', 0.018656402826309204), ('woman', 0.018396249040961266), ('run', 0.015723947435617447), ('runner', 0.014907144010066986), ('world', 0.013095640577375889), ('event', 0.011232488788664341)]



Top 10 words of level 1 parent topic #14 of 203 documents: [('usa', 0.06642155349254608), ('gbr', 0.04223703220486641), ('ger', 0.03960828110575676), ('rus', 0.033474527299404144), ('jump', 0.032948777079582214), ('jam', 0.030845774337649345), ('aus', 0.026639770716428757), ('ken', 0.0241862703114748), ('pol', 0.023310018703341484), ('esp', 0.021908018738031387)]
    Top 10 words of child topic #53 of 115 documents: [('mark', 0.06642331182956696), ('country', 0.054394837468862534), ('final', 0.036613620817661285), ('ger', 0.03347575664520264), ('time', 0.020924309268593788), ('iaaf', 0.020401332527399063), ('xxx', 0.01987835392355919), ('russia', 0.01883240044116974), ('athlete', 0.014125608839094639), ('result', 0.013602632097899914)]



Top 10 words of level 1 parent topic #15 of 986 documents: [('finish', 0.027456147596240044), ('win', 0.024502897635102272), ('woman', 0.023487716913223267), ('runner', 0.02088824287056923), ('race', 0.019580814987421036), ('lead', 0.019565433263778687), ('title', 0.015612385235726833), ('spot', 0.012136164121329784), ('time', 0.011736244894564152), ('meet', 0.011120984330773354)]
    Top 10 words of child topic #156 of 168 documents: [('week', 0.03476425260305405), ('woman', 0.025404971092939377), ('ranking', 0.016774985939264297), ('style_link_balloon_text', 0.014951751567423344), ('font_family_tahoma_san', 0.014587104320526123), ('mso_style_priority_mso', 0.014465555548667908), ('font_family_calibri_san', 0.013979358598589897), ('east', 0.009968239814043045), ('bottom_pt_font_size', 0.009968239814043045), ('final', 0.009603592567145824)]
    Top 10 words of child topic #30 of 693 documents: [('win', 0.03422453626990318), ('time', 0.027943121269345284), ('event', 0.01874011568725109), ('meter', 0.013856887817382812), ('season', 0.013356043957173824), ('jump', 0.013293438591063023), ('meet', 0.01314735971391201), ('title', 0.012229145504534245), ('mark', 0.01072661392390728), ('record', 0.010622271336615086)]



Top 10 words of level 1 parent topic #33 of 181 documents: [('win', 0.036080773919820786), ('ncaa', 0.0272316075861454), ('mile', 0.025189492851495743), ('senior', 0.024849141016602516), ('finish', 0.023147378116846085), ('time', 0.02280702441930771), ('run', 0.020424555987119675), ('rank', 0.020084204152226448), ('junior', 0.01804208755493164), ('spot', 0.01702103018760681)]
    Top 10 words of child topic #47 of 169 documents: [('amp_amp_amp_gt', 0.040227361023426056), ('amp_amp_amp_lt', 0.038723673671483994), ('amp_amp_gt_amp', 0.034212615340948105), ('amp_amp_lt_span', 0.03120524436235428), ('class_redactor_selection_marker', 0.024438656866550446), ('run', 0.024062735959887505), ('redactor_selection_marker_data', 0.022934969514608383), ('span_id_selection_marker', 0.02180720493197441), ('lt_span_amp_amp', 0.02143128402531147), ('verified_redactor_amp_amp', 0.02105536311864853)]



Top 10 words of level 1 parent topic #35 of 387 documents: [('women', 0.08514520525932312), ('race', 0.0733507052063942), ('result', 0.05639611929655075), ('mile', 0.03907295688986778), ('results', 0.03206997364759445), ('woman', 0.026909880340099335), ('heat', 0.02101263217628002), ('jump', 0.019169742241501808), ('national', 0.01806400716304779), ('live_update', 0.016958273947238922)]
    Top 10 words of child topic #51 of 344 documents: [('live', 0.06258029490709305), ('watch_live', 0.05057558789849281), ('event', 0.04240216687321663), ('start', 0.029631201177835464), ('schedule', 0.028098685666918755), ('watch', 0.02784326672554016), ('meet', 0.023245716467499733), ('list', 0.02222403883934021), ('pm_et', 0.020946942269802094), ('link', 0.01915900781750679)]



Top 10 words of level 1 parent topic #66 of 336 documents: [('strictly_advance_notice_basis', 0.08189871907234192), ('competition_test_conduct', 0.07952501624822617), ('impose_follow_sanction_violation', 0.07893159240484238), ('guilty_follow_dope_violation', 0.06468937546014786), ('protect_sport_athletic_threat', 0.06112881749868393), ('doping_iaaf_spend_dollar', 0.06112881749868393), ('iaaf_inform', 0.06053538993000984), ('iaaf_rule_presence_prohibit_substance', 0.05104057490825653), ('iaaf_rule_dope_violation_relate', 0.05044714733958244), ('anabolic_agents_dope_control_sample', 0.04391946271061897)]
    Top 10 words of child topic #572 of 336 documents: [('iaaf', 0.06441847234964371), ('iaaf_commit_protect_sport', 0.06303616613149643), ('athletic_threat_doping_iaaf', 0.06303616613149643), ('spend_dollar_ongoing_battle', 0.06303616613149643), ('add_list_anti', 0.06248323991894722), ('website_anti_dope_statistic', 0.05861276760697365), ('section_note_iaaf_rule_athlete', 0.05778338015079498), ('sanction_dope_violation_appeal', 0.05778338015079498), ('decision_relevant_appeal_body', 0.05778338015079498), ('competition_test_conduct', 0.05446583405137062)]



Top 10 words of level 1 parent topic #95 of 407 documents: [('event', 0.02210836671292782), ('finish', 0.02103995345532894), ('win', 0.016519738361239433), ('woman', 0.013807610608637333), ('record', 0.013355589471757412), ('time', 0.013067939318716526), ('jump', 0.011876245960593224), ('school', 0.011588596738874912), ('mark', 0.011218761093914509), ('effort', 0.008794281631708145)]
    Top 10 words of child topic #132 of 399 documents: [('time', 0.035251084715127945), ('arkansas', 0.021683383733034134), ('final', 0.021218447014689445), ('event', 0.02062670886516571), ('finish', 0.017921622842550278), ('meet', 0.015554671175777912), ('razorbacks', 0.014328929595649242), ('meter', 0.013948526233434677), ('run', 0.01306091994047165), ('win', 0.012680516578257084)]



Top 10 words of level 1 parent topic #140 of 142 documents: [('race', 0.06461523473262787), ('win', 0.03598868474364281), ('run', 0.030161138623952866), ('finish', 0.02719624526798725), ('lead', 0.023106737062335014), ('ahead', 0.021879885345697403), ('japan', 0.021573171019554138), ('start', 0.021061982959508896), ('time', 0.020448558032512665), ('marathon', 0.019835131242871284)]
    Top 10 words of child topic #527 of 132 documents: [('stage', 0.1435365378856659), ('lead', 0.053494635969400406), ('record', 0.03942904993891716), ('runner', 0.03787851333618164), ('run', 0.03710324317216873), ('km_stage', 0.025363462045788765), ('stage_km', 0.02514195628464222), ('win', 0.022373141720891), ('start', 0.017832282930612564), ('move', 0.0172785185277462)]



Top 10 words of level 1 parent topic #141 of 971 documents: [('eat', 0.018273433670401573), ('food', 0.015966806560754776), ('run', 0.01318087987601757), ('help', 0.011668091639876366), ('body', 0.008163215592503548), ('don', 0.007848675362765789), ('runner', 0.007818719372153282), ('diet', 0.007609026040881872), ('time', 0.006815186236053705), ('calorie', 0.006710339803248644)]
    Top 10 words of child topic #183 of 890 documents: [('add', 0.02718305215239525), ('minute', 0.016030484810471535), ('cup', 0.01590346172451973), ('recipe', 0.012067384086549282), ('serve', 0.01166091300547123), ('cook', 0.01120363175868988), ('heat', 0.010873373597860336), ('taste', 0.007037296425551176), ('tablespoon', 0.006681633647531271), ('cut', 0.0065292068757116795)]
Top 10 words of level 1 parent topic #160 of 160 documents: [('athlete', 0.08462043851613998), ('event', 0.08146046847105026), ('race', 0.03616759181022644), ('compete', 0.023176612332463264), ('entry', 0.023176612332463264), ('hold', 0.021772179752588272), ('competition', 0.019665535539388657), ('final', 0.01931442692875862), ('relay', 0.017207782715559006), ('list', 0.017207782715559006)]



Top 10 words of level 1 parent topic #167 of 3605 documents: [('shoe', 0.050075653940439224), ('run', 0.0304963868111372), ('feel', 0.02082579955458641), ('foot', 0.01544780284166336), ('runner', 0.013564104214310646), ('fit', 0.009180476889014244), ('trail', 0.009166471660137177), ('road', 0.00758388452231884), ('tester', 0.00736680394038558), ('look', 0.006904632318764925)]
    Top 10 words of child topic #376 of 281 documents: [('shoe', 0.08123554289340973), ('adidas', 0.030036069452762604), ('design', 0.02431545965373516), ('nike', 0.024029428139328957), ('sneaker', 0.020024999976158142), ('release', 0.019166909158229828), ('brand', 0.01859484799206257), ('style', 0.01802278496325016), ('available', 0.017164694145321846), ('color', 0.01401835773140192)]
    Top 10 words of child topic #180 of 3196 documents: [('shoe', 0.08854697644710541), ('battery_mechanical_test_lab', 0.05375658720731735), ('shoe_real_world_usage', 0.05375658720731735), ('provide_objective_exclusive_datum', 0.05375658720731735), ('addition_shoe_weight_measure', 0.05208312347531319), ('sole_thickness_sit_foot', 0.05208312347531319), ('road_foam_cushion_stride', 0.05208312347531319), ('flexibility_forefoot_account_review', 0.05208312347531319), ('review_shoe', 0.018290938809514046), ('datum_addition_shoe_weight', 0.017967989668250084)]



Top 10 words of level 1 parent topic #193 of 196 documents: [('run', 0.03880670294165611), ('runner', 0.032584354281425476), ('injury', 0.030242612585425377), ('pain', 0.01799863949418068), ('knee', 0.01351587288081646), ('help', 0.013315152376890182), ('treatment', 0.012311547994613647), ('time', 0.011843198910355568), ('muscle', 0.010438153520226479), ('leg', 0.009300734847784042)]
    Top 10 words of child topic #188 of 176 documents: [('foot', 0.032363589853048325), ('pain', 0.02274235151708126), ('stretch', 0.015745090320706367), ('run', 0.015745090320706367), ('tendon', 0.014745481312274933), ('joint', 0.013371018692851067), ('plantar_fasciitis', 0.011496752500534058), ('ankle', 0.01099694799631834), ('heel', 0.010871997103095055), ('week', 0.009497534483671188)]



Top 10 words of level 1 parent topic #196 of 1734 documents: [('study', 0.017713475972414017), ('exercise', 0.01592574268579483), ('body', 0.009414955973625183), ('people', 0.008743579499423504), ('result', 0.00800975039601326), ('researcher', 0.007611608598381281), ('athlete', 0.006963652558624744), ('help', 0.006659191567450762), ('eat', 0.006354730110615492), ('time', 0.006229822989553213)]
    Top 10 words of child topic #205 of 1576 documents: [('run', 0.055178452283144), ('runner', 0.030901353806257248), ('training', 0.02480400912463665), ('time', 0.01632300205528736), ('workout', 0.01595163531601429), ('race', 0.015122534707188606), ('week', 0.010122021660208702), ('pace', 0.00959519762545824), ('minute', 0.008800642564892769), ('performance', 0.008368819952011108)]



Top 10 words of level 1 parent topic #217 of 412 documents: [('run', 0.050233639776706696), ('exercise', 0.03616838529706001), ('runner', 0.02813109941780567), ('muscle', 0.016311557963490486), ('workout', 0.014834114350378513), ('help', 0.014302235096693039), ('time', 0.013061183504760265), ('improve', 0.012765695340931416), ('leg', 0.012174718081951141), ('jump', 0.010992763563990593)]
    Top 10 words of child topic #215 of 406 documents: [('leg', 0.02565544657409191), ('foot', 0.0225308109074831), ('position', 0.01957063004374504), ('hold', 0.01611708663403988), ('stretch', 0.016062268987298012), ('knee', 0.015842996537685394), ('stand', 0.015075542032718658), ('start', 0.014308087527751923), ('set', 0.014143633656203747), ('straight', 0.014088815078139305)]



Top 10 words of level 1 parent topic #245 of 171 documents: [('race', 0.09784720838069916), ('event', 0.040654584765434265), ('marathon', 0.027900831773877144), ('runner', 0.022719619795680046), ('half_marathon', 0.016741296276450157), ('run', 0.013353580608963966), ('write', 0.01255646999925375), ('cancel', 0.011958638206124306), ('lottery', 0.011560083366930485), ('receive', 0.011161528527736664)]
    Top 10 words of child topic #262 of 166 documents: [('runner', 0.06970928609371185), ('registration', 0.045303262770175934), ('time', 0.04329729080200195), ('qualifier', 0.039285339415073395), ('marathon', 0.032097261399030685), ('race', 0.029088301584124565), ('run', 0.022903213277459145), ('register', 0.02039574645459652), ('boston_marathon', 0.0192255936563015), ('entry', 0.0192255936563015)]



Top 10 words of level 1 parent topic #415 of 145 documents: [('record', 0.13075846433639526), ('set', 0.0553889200091362), ('performance', 0.03385476768016815), ('mark', 0.03231661394238472), ('usa', 0.029240306466817856), ('previous_mark', 0.027702152729034424), ('list', 0.02308768965303898), ('age', 0.021549535915255547), ('russian', 0.018473228439688683), ('note', 0.01693507470190525)]
    Top 10 words of child topic #452 of 138 documents: [('previous', 0.14780761301517487), ('world_record', 0.07911153137683868), ('women', 0.04441653564572334), ('follow', 0.04372263327240944), ('iaaf', 0.03400803357362747), ('senior', 0.03262023255228996), ('mar', 0.025681234896183014), ('aug', 0.024293435737490654), ('ratify', 0.022905634716153145), ('equal', 0.01804833672940731)]



Top 10 words of level 1 parent topic #434 of 1296 documents: [('time', 0.03169409558176994), ('run', 0.025730175897479057), ('race', 0.022309115156531334), ('nsw', 0.020539602264761925), ('australian', 0.01690882071852684), ('clock', 0.01683017611503601), ('athlete', 0.016305875033140182), ('compete', 0.014837834052741528), ('win', 0.014392178505659103), ('event', 0.012740632519125938)]
    Top 10 words of child topic #443 of 955 documents: [('win', 0.06009676307439804), ('jump', 0.033705923706293106), ('event', 0.021374177187681198), ('hurdle', 0.020322749391198158), ('final', 0.020082421600818634), ('competition', 0.018535321578383446), ('record', 0.015320955775678158), ('throw', 0.01434462983161211), ('run', 0.013653691858053207), ('time', 0.013353284448385239)]
Top 10 words of level 1 parent topic #437 of 125 documents: [('athlete', 0.04406151548027992), ('coach', 0.040754735469818115), ('athletic', 0.025918906554579735), ('time', 0.024846436455845833), ('compete', 0.02323773317039013), ('sport', 0.020735302940011024), ('jump', 0.01796475611627102), ('help', 0.016892287880182266), ('answ', 0.016624169424176216), ('level', 0.011887429282069206)]



Top 10 words of level 1 parent topic #512 of 4131 documents: [('iaaf', 0.03794722259044647), ('athlete', 0.02330145053565502), ('sport', 0.016651879996061325), ('des', 0.008209897205233574), ('programme', 0.008107739500701427), ('coach', 0.00799629371613264), ('competition', 0.006696098484098911), ('include', 0.006306040100753307), ('les', 0.005767387803643942), ('athletic', 0.005674516782164574)]
    Top 10 words of child topic #510 of 3707 documents: [('event', 0.030564773827791214), ('athletic', 0.020347116515040398), ('sport', 0.019405532628297806), ('athlete', 0.016845600679516792), ('world', 0.014094410464167595), ('competition', 0.012799732387065887), ('iaaf', 0.011240233667194843), ('meeting', 0.00943798292428255), ('championship', 0.009283504448831081), ('include', 0.008194797672331333)]



Top 10 words of level 1 parent topic #544 of 262 documents: [('und', 0.07650472968816757), ('der', 0.06943455338478088), ('die', 0.055475492030382156), ('das', 0.029732801020145416), ('ein', 0.02882636897265911), ('nicht', 0.02864508144557476), ('ich', 0.025744497776031494), ('ist', 0.024294205009937286), ('den', 0.022662628442049026), ('hatte', 0.02121233567595482)]
    Top 10 words of child topic #539 of 262 documents: [('die', 0.08208174258470535), ('der', 0.0590418316423893), ('und', 0.05431042239069939), ('mit', 0.04381902888417244), ('ber', 0.02674480900168419), ('den', 0.025510530918836594), ('eine', 0.02386482246220112), ('einen', 0.023041969165205956), ('wurde', 0.022013401612639427), ('minuten', 0.021190546452999115)]



Top 10 words of level 1 parent topic #579 of 161 documents: [('que', 0.09543485939502716), ('del', 0.0463360920548439), ('para', 0.04391573369503021), ('bolt', 0.036308880895376205), ('con', 0.03146816045045853), ('iaaf', 0.030430860817432404), ('de_la', 0.02973932959139347), ('en_el', 0.025590138509869576), ('como', 0.02386130765080452), ('una', 0.02386130765080452)]
    Top 10 words of child topic #598 of 161 documents: [('que', 0.04204784333705902), ('del', 0.03889451548457146), ('en_la', 0.038193777203559875), ('final', 0.03539082035422325), ('en_el', 0.03504045307636261), ('metro', 0.03433971479535103), ('con', 0.030836017802357674), ('los', 0.027682693675160408), ('de_la', 0.025930846109986305), ('pero', 0.024529367685317993)]

Multi-grain LDAđź”—

Multi-grain LDA (MGLDA) learns local topics in addition to the global topics learned by LDA. Think of a local topic as a topic spanning all global topics. For example, in a corpus of documents about nutrition, global topics may be dairy products, meats, pastas, fruits, and vegetables while a local topic may be about vitamins, protein, amino acids, and dieting, since they are topics discussed in most global topics, but are discussed differently in each document. By finding local topics, this model can filter out words in local topics being included in global topics, producing more coherent global topics. MGLDA models take the number of global topics and the number of local topics as parameters, so you can vary the number of local/global topics that are in your corpus.

import tomotopy as tp
file_name = 'running_articles.txt.clean'

# k_g is th number of global topics, while k_l is the number of local topics
mdl = tp.MGLDAModel(k_g=30, k_l=30, min_cf=100)
for line in open(file_name, 'r'):
    document = line.strip().split()
    mdl.add_doc(document)

print('Starting training model')
iterations = 10
for i in range(0, 100, iterations):
        mdl.train(iterations)
        print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

for k in range(mdl.k_g):
    print('Top 10 words of global topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

for k in range(mdl.k_l):
    print('Top 10 words of local topic #{}'.format(k))
    print(mdl.get_topic_words(mdl.k_g + k, top_n=10))

This model trained in roughly 10 minutes, producing the following topics:

Top 10 words of global topic #0
[('boston', 0.024035736918449402), ('boston_marathon', 0.017629483714699745), ('finisher', 0.016284659504890442), ('city', 0.014988738112151623), ('finish_line', 0.0120851406827569), ('participant', 0.011938432231545448), ('elite', 0.007824492640793324), ('york', 0.007384368684142828), ('organizer', 0.00641242740675807), ('register', 0.005648322869092226)]
Top 10 words of global topic #1
[('farah', 0.015569980256259441), ('dibaba', 0.012552689760923386), ('diamond_league', 0.009686845354735851), ('beijing', 0.008923784829676151), ('rudisha', 0.00880728755146265), ('defar', 0.008352945558726788), ('daegu', 0.008341296575963497), ('doha', 0.00800345279276371), ('brussels', 0.007799581624567509), ('monaco', 0.007403489202260971)]
Top 10 words of global topic #2
[('hall', 0.014030586928129196), ('keflezighi', 0.01341606117784977), ('boston', 0.01315367966890335), ('flanagan', 0.012987964786589146), ('trial', 0.011682961136102676), ('rupp', 0.010260575450956821), ('huddle', 0.009984384290874004), ('debut', 0.008610332384705544), ('goucher', 0.007761044427752495), ('kastor', 0.006925565656274557)]
Top 10 words of global topic #3
[('course_record', 0.032830528914928436), ('ethiopia', 0.013442019000649452), ('ethiopian', 0.012429723516106606), ('debut', 0.01139530073851347), ('berlin', 0.010582144372165203), ('kiplagat', 0.007954600267112255), ('defend_champion', 0.006997620686888695), ('london_marathon', 0.0068261390551924706), ('marathon_debut', 0.006698910612612963), ('chicago', 0.0063836053013801575)]
Top 10 words of global topic #4
[('friend', 0.010972775518894196), ('family', 0.010570644401013851), ('remember', 0.007494700141251087), ('kid', 0.006681032944470644), ('moment', 0.0064129456877708435), ('learn', 0.006396484095603228), ('fun', 0.005698047112673521), ('person', 0.005509916227310896), ('play', 0.005046642851084471), ('mind', 0.0050395880825817585)]
Top 10 words of global topic #5
[('company', 0.014105774462223053), ('brand', 0.012160157784819603), ('nike', 0.009656957350671291), ('product', 0.008541787043213844), ('store', 0.008411288261413574), ('wear', 0.008079110644757748), ('offer', 0.006987667176872492), ('design', 0.00697580398991704), ('video', 0.0067563289776444435), ('create', 0.006382628343999386)]
Top 10 words of global topic #6
[('report', 0.014753046445548534), ('rule', 0.01067736092954874), ('official', 0.009579848498106003), ('decision', 0.009155183099210262), ('test', 0.007930822670459747), ('accord', 0.007886701263487339), ('issue', 0.006750583183020353), ('receive', 0.006160463206470013), ('usatf', 0.006088766735047102), ('write', 0.00593985803425312)]
Top 10 words of global topic #7
[('heptathlon', 0.015265149995684624), ('pole_vault', 0.014611037448048592), ('height', 0.01419240515679121), ('score', 0.014057844877243042), ('decathlon', 0.012981363572180271), ('javelin', 0.011665662750601768), ('clearance', 0.008353985846042633), ('discus', 0.0077484650537371635), ('leap', 0.007707349024713039), ('jumper', 0.006589751690626144)]
Top 10 words of global topic #8
[('stay_informed_late_news', 0.02110220491886139), ('register_weekly_newsletter', 0.02108335681259632), ('british', 0.013500658795237541), ('britain', 0.012646269984543324), ('france', 0.011245324276387691), ('triumph', 0.011107114143669605), ('germany', 0.010585686191916466), ('european_champion', 0.00958680547773838), ('russia', 0.00848112627863884), ('poland', 0.007915722206234932)]
Top 10 words of global topic #9
[('city', 0.016673676669597626), ('host', 0.01540266815572977), ('stadium', 0.012492728419601917), ('fan', 0.010614088736474514), ('venue', 0.008512466214597225), ('track_field', 0.008105520159006119), ('international', 0.007341800257563591), ('berlin', 0.00705192144960165), ('edition', 0.006461014039814472), ('organiser', 0.006243604701012373)]
Top 10 words of global topic #10
[('bad', 0.006993656046688557), ('wear', 0.006270969286561012), ('hand', 0.006033897865563631), ('car', 0.005085610784590244), ('cold', 0.004990017041563988), ('hour', 0.004630585666745901), ('warm', 0.004160265903919935), ('call', 0.003999668639153242), ('remember', 0.003953783772885799), ('hear', 0.003877309150993824)]
Top 10 words of global topic #11
[('kilometre', 0.016196072101593018), ('ethiopia', 0.01294862199574709), ('ethiopian', 0.010910377837717533), ('bekele', 0.009270622394979), ('compatriot', 0.007159092463552952), ('italian', 0.00555597897619009), ('bronze_medallist', 0.005203294102102518), ('battle', 0.004905573092401028), ('podium', 0.004891831893473864), ('leader', 0.004850609228014946)]
Top 10 words of global topic #12
[('shape', 0.011439691297709942), ('prepare', 0.01069772057235241), ('preparation', 0.01064788643270731), ('tough', 0.01057036779820919), ('olympics', 0.010420866310596466), ('look_forward', 0.009800711646676064), ('bit', 0.009485096670687199), ('excited', 0.008300159126520157), ('competitive', 0.008056527003645897), ('confidence', 0.007464057765901089)]
Top 10 words of global topic #13
[('program', 0.014747320674359798), ('community', 0.012894353829324245), ('club', 0.01017786841839552), ('family', 0.009755104780197144), ('child', 0.009228898212313652), ('kid', 0.00782118272036314), ('local', 0.006818242371082306), ('grow', 0.00648542819544673), ('create', 0.005756834521889687), ('student', 0.0056039197370409966)]
Top 10 words of global topic #14
[('ncaa', 0.01912536472082138), ('oregon', 0.01314721442759037), ('freshman', 0.012218338437378407), ('rank', 0.012199285440146923), ('squad', 0.009374549612402916), ('score', 0.008155101910233498), ('girl', 0.00815033819526434), ('spot', 0.008112231269478798), ('sophomore', 0.007778788451105356), ('top', 0.007283387705683708)]
Top 10 words of global topic #15
[('hour', 0.008924037218093872), ('mountain', 0.00887240283191204), ('route', 0.007207222282886505), ('eat', 0.006862998474389315), ('food', 0.006764034274965525), ('water', 0.00634666346013546), ('ultra', 0.006286424584686756), ('favorite', 0.006023954134434462), ('hill', 0.005838933866471052), ('pack', 0.004668573848903179)]
Top 10 words of global topic #16
[('jamaica', 0.017771722748875618), ('bolt', 0.015777232125401497), ('semi_final', 0.01284845918416977), ('jamaican', 0.010788975283503532), ('sec', 0.010658987797796726), ('sprinter', 0.010638677515089512), ('johnson', 0.008652310818433762), ('hurdles', 0.008071430958807468), ('lane', 0.007506799418479204), ('jones', 0.007104651536792517)]
Top 10 words of global topic #17
[('study', 0.011447546072304249), ('exercise', 0.009012983180582523), ('recovery', 0.006715363822877407), ('increase', 0.006515019573271275), ('percent', 0.006444011349231005), ('benefit', 0.0055893780663609505), ('relate', 0.004308695904910564), ('research', 0.004285871982574463), ('faster', 0.004212327767163515), ('effect', 0.004209791775792837)]
Top 10 words of global topic #18
[('japanese', 0.025016404688358307), ('japan', 0.02378581091761589), ('china', 0.02293386310338974), ('chinese', 0.01648830994963646), ('national_record', 0.011798287741839886), ('osaka', 0.010163234546780586), ('beijing', 0.009767379611730576), ('games', 0.007581571117043495), ('india', 0.006712411064654589), ('asian', 0.006591933313757181)]
Top 10 words of global topic #19
[('fit', 0.014452415518462658), ('provide_objective_exclusive_datum', 0.011898950673639774), ('battery_mechanical_test_lab', 0.011898950673639774), ('shoe_real_world_usage', 0.011898950673639774), ('road_foam_cushion_stride', 0.011529541574418545), ('flexibility_forefoot_account_review', 0.011529541574418545), ('addition_shoe_weight_measure', 0.011529541574418545), ('sole_thickness_sit_foot', 0.011529541574418545), ('light', 0.010084305889904499), ('ride', 0.008729803375899792)]
Top 10 words of global topic #20
[('women', 0.03362782299518585), ('gbr', 0.027514899149537086), ('ken', 0.023752082139253616), ('ger', 0.02290709875524044), ('rus', 0.019659195095300674), ('mar', 0.014259223826229572), ('fra', 0.012899328954517841), ('jam', 0.012780503369867802), ('esp', 0.011737477034330368), ('brazil', 0.011288579553365707)]
Top 10 words of global topic #21
[('russian', 0.011753606610000134), ('russia', 0.008098330348730087), ('hurdles', 0.007086044177412987), ('outdoor', 0.007041871547698975), ('national_record', 0.006883586756885052), ('jumper', 0.006813646759837866), ('italian', 0.006659043021500111), ('bronze_medallist', 0.006353516597300768), ('helsinki', 0.006320387125015259), ('leap', 0.006191550754010677)]
Top 10 words of global topic #22
[('track_field', 0.017225435003638268), ('arkansas', 0.012159928679466248), ('women', 0.010906473733484745), ('panose_font_font_family', 0.009717630222439766), ('conference', 0.009161975234746933), ('rank', 0.008825997821986675), ('schedule', 0.008800153620541096), ('williams', 0.008528787642717361), ('font_family_calibri_san', 0.008464176207780838), ('pole_vault', 0.007766377180814743)]
Top 10 words of global topic #23
[('award', 0.016018403694033623), ('programme', 0.011424679309129715), ('president', 0.010079204104840755), ('des', 0.009155682288110256), ('vote', 0.007236986421048641), ('athletics', 0.007053874433040619), ('official', 0.006910569500178099), ('les', 0.006424924358725548), ('federation', 0.006416962947696447), ('development', 0.006082584615796804)]
Top 10 words of global topic #24
[('lagat', 0.014886642806231976), ('simpson', 0.009470997378230095), ('american_record', 0.008497470989823341), ('oregon', 0.008071955293416977), ('outdoor', 0.007865644991397858), ('rupp', 0.007478813640773296), ('cain', 0.007188689429312944), ('eugene', 0.007130664773285389), ('track_field', 0.0069759320467710495), ('wilson', 0.006531075574457645)]
Top 10 words of global topic #25
[('walker', 0.04585910961031914), ('australia', 0.025868376716971397), ('die', 0.02046995796263218), ('australian', 0.018738042563199997), ('mexico', 0.018424823880195618), ('der', 0.01381866354495287), ('und', 0.013100102543830872), ('berlin', 0.012307843193411827), ('world_cup', 0.011865652166306973), ('race_walking', 0.010999693535268307)]
Top 10 words of global topic #26
[('pack', 0.013749918900430202), ('leader', 0.010822350159287453), ('catch', 0.009628056548535824), ('surge', 0.009205400943756104), ('pull', 0.009023155085742474), ('split', 0.008988257497549057), ('kick', 0.00853458046913147), ('stay', 0.007204572670161724), ('straight', 0.0062623219564557076), ('hit', 0.006235179025679827)]
Top 10 words of global topic #27
[('australian', 0.03417911380529404), ('nsw', 0.027681604027748108), ('australia', 0.01706274226307869), ('girl', 0.01641164906322956), ('boy', 0.012652759440243244), ('claim', 0.01059208158403635), ('qualifier', 0.009766468778252602), ('discus', 0.008920717984437943), ('sydney', 0.00871263723820448), ('selection', 0.007927297614514828)]
Top 10 words of global topic #28
[('muscle', 0.0150497667491436), ('exercise', 0.014240351505577564), ('pain', 0.012276172637939453), ('stretch', 0.008763314224779606), ('knee', 0.008741729892790318), ('strength', 0.007608549669384956), ('stride', 0.006653440650552511), ('hip', 0.0056821433827281), ('low', 0.005455507431179285), ('movement', 0.005336793139576912)]
Top 10 words of global topic #29
[('athens', 0.014426724053919315), ('sydney', 0.010388976894319057), ('radcliffe', 0.010346021503210068), ('olympics', 0.00990420114248991), ('paris', 0.008965332992374897), ('bear', 0.008296466432511806), ('south_africa', 0.007707372307777405), ('british', 0.0076828268356621265), ('edmonton', 0.006940322928130627), ('gebrselassie', 0.006087364163249731)]


Top 10 words of local topic #0
[('win_qualified_list', 0.057440780103206635), ('florida_relay', 0.0339263379573822), ('win_head_head', 0.031613439321517944), ('region', 0.025445716455578804), ('invitational', 0.02236185222864151), ('qualifier_enter_win', 0.02120540477335453), ('recent_winners', 0.01850702613592148), ('rice', 0.01850702613592148), ('foot_locker', 0.01735057681798935), ('texas_relays', 0.01735057681798935)]
Top 10 words of local topic #1
[('krar', 0.05260445922613144), ('ker', 0.049920834600925446), ('komen', 0.03757615014910698), ('massage', 0.03059871681034565), ('epo', 0.02684163860976696), ('rly', 0.026304911822080612), ('pun', 0.025231461971998215), ('aden', 0.020937658846378326), ('antibiotic', 0.02040093205869198), ('meditation', 0.01986420713365078)]
Top 10 words of local topic #2
[('average', 0.18885190784931183), ('cain', 0.0813741534948349), ('elliott', 0.04656354337930679), ('time_average_total', 0.03960142284631729), ('davis', 0.03350956365466118), ('middle', 0.028723105788230896), ('siegel', 0.026547441259026527), ('efraimson', 0.02350151166319847), ('jelimo', 0.02219611406326294), ('newton', 0.019150186330080032)]
Top 10 words of local topic #3
[('food', 0.0209722351282835), ('eat', 0.020072465762495995), ('diet', 0.012735885567963123), ('protein', 0.010659495368599892), ('snack', 0.009759726002812386), ('meal', 0.008444679901003838), ('cup', 0.007890976034104824), ('serve', 0.0075449105352163315), ('fat', 0.0075449105352163315), ('healthy', 0.007475697435438633)]
Top 10 words of local topic #4
[('barefoot', 0.0995924323797226), ('recent_winners', 0.032831087708473206), ('sport_drink', 0.028773769736289978), ('product', 0.026191839948296547), ('supplement', 0.025454144924879074), ('lactate', 0.01955258846282959), ('ali', 0.018814893439412117), ('vitamin', 0.018446046859025955), ('carbohydrate', 0.01733950525522232), ('caffeine', 0.014388727955520153)]
Top 10 words of local topic #5
[('solinsky', 0.05966264754533768), ('gibb', 0.053327351808547974), ('track_club', 0.04804793745279312), ('hastings', 0.03696117177605629), ('james', 0.03696117177605629), ('club', 0.035377345979213715), ('smith', 0.031153814867138863), ('unattached', 0.030625872313976288), ('seed', 0.022178811952471733), ('champs', 0.021122930571436882)]
Top 10 words of local topic #6
[('dog', 0.056635621935129166), ('food', 0.03359382599592209), ('eat', 0.024710243567824364), ('thompson', 0.02221173606812954), ('wardian', 0.01665949448943138), ('sugar', 0.01610427163541317), ('protein', 0.015271435491740704), ('coffee', 0.01332815084606409), ('bar', 0.012495315633714199), ('probiotic', 0.012217703275382519)]
Top 10 words of local topic #7
[('klishina', 0.012808435596525669), ('ohio', 0.012106699869036674), ('john', 0.012106699869036674), ('michigan', 0.010352359153330326), ('brown', 0.010176924988627434), ('texas', 0.010001490823924541), ('florida', 0.010001490823924541), ('katie', 0.009650623425841331), ('princeton', 0.00947518926113844), ('indiana', 0.009299755096435547)]
Top 10 words of local topic #8
[('eat', 0.0513167530298233), ('diet', 0.0420110858976841), ('food', 0.03536418452858925), ('john', 0.025792645290493965), ('sugar', 0.016752853989601135), ('child', 0.01568935066461563), ('healthy', 0.014625845476984978), ('grain', 0.013296465389430523), ('juli', 0.013030588626861572), ('bunion', 0.011701208539307117)]
Top 10 words of local topic #9
[('nsw', 0.06237469241023064), ('australian', 0.03043569065630436), ('score', 0.02759295329451561), ('nsw_athlete', 0.021740255877375603), ('leap', 0.012877603992819786), ('perth', 0.012375944294035435), ('claim', 0.011539844796061516), ('decathlon', 0.011539844796061516), ('national_record', 0.010536524467170238), ('tremendous', 0.010202084667980671)]
Top 10 words of local topic #10
[('pre', 0.18959669768810272), ('wis', 0.06619299948215485), ('treadmill', 0.0370248518884182), ('milligram', 0.0160836149007082), ('plaque', 0.01496176328510046), ('iron', 0.013092010281980038), ('calcium', 0.012344108894467354), ('nutrient', 0.011596208438277245), ('heart', 0.011596208438277245), ('coconut_oil', 0.01084830705076456)]
Top 10 words of local topic #11
[('panose_font_bfont_family', 0.11549337208271027), ('mri', 0.054309144616127014), ('david', 0.03982767090201378), ('amy', 0.03439712151885033), ('beer_mile', 0.03331100940704346), ('crossfit', 0.03041471540927887), ('family', 0.028604531660676003), ('efraimson', 0.027518419548869133), ('bowman', 0.026432309299707413), ('steve', 0.02498416230082512)]
Top 10 words of local topic #12
[('unattached', 0.11896561831235886), ('nike', 0.02457546442747116), ('ccs', 0.020668795332312584), ('sjs', 0.01651008427143097), ('sds', 0.015879977494478226), ('foul', 0.011595244519412518), ('girl', 0.011091157793998718), ('adidas', 0.0109651368111372), ('florida_stat', 0.010587071999907494), ('brown', 0.010082985274493694)]
Top 10 words of local topic #13
[('colle', 0.01950256898999214), ('heart', 0.013734474778175354), ('brittany', 0.011537105776369572), ('smith', 0.011537105776369572), ('michael', 0.010804649442434311), ('connor', 0.01034686341881752), ('unattached', 0.009522850625216961), ('lauren', 0.009339735843241215), ('lowell', 0.009339735843241215), ('scott', 0.009248179383575916)]
Top 10 words of local topic #14
[('penn', 0.025970173999667168), ('bucknell', 0.01976880617439747), ('facility', 0.01744329184293747), ('pittsburgh', 0.016474328935146332), ('akron', 0.016086742281913757), ('syracuse', 0.01589294895529747), ('villanova', 0.014730192720890045), ('johnson', 0.013955021277070045), ('usc', 0.013373643159866333), ('jones', 0.012210885062813759)]
Top 10 words of local topic #15
[('meb', 0.06689087301492691), ('skin', 0.026908060535788536), ('nail', 0.02354501746594906), ('patient', 0.02055564895272255), ('toe', 0.01831362210214138), ('arg', 0.01756628043949604), ('chi', 0.01644526608288288), ('compression_sock', 0.014576910994946957), ('texas_relays', 0.014576910994946957), ('grey', 0.013455897569656372)]
Top 10 words of local topic #16
[('eat', 0.023904260247945786), ('food', 0.016701558604836464), ('cup', 0.015830902382731438), ('serve', 0.008944803848862648), ('meal', 0.008549051359295845), ('tablespoon', 0.008469901047647), ('calorie', 0.008232449181377888), ('egg', 0.008232449181377888), ('cook', 0.007757545914500952), ('rodgers', 0.007678395602852106)]
Top 10 words of local topic #17
[('oceans', 0.03273902088403702), ('win_qualified_list', 0.027065010741353035), ('ultra', 0.026192087680101395), ('comrades', 0.025319162756204605), ('stride', 0.024446239694952965), ('hunter', 0.019208693876862526), ('sambu', 0.017462845891714096), ('wright', 0.016589922830462456), ('robert', 0.016589922830462456), ('zatopek', 0.01615346036851406)]
Top 10 words of local topic #18
[('hill', 0.044214390218257904), ('heart_rate', 0.027302708476781845), ('power', 0.024645157158374786), ('gps', 0.020779630169272423), ('device', 0.017880484461784363), ('app', 0.016430910676717758), ('snowshoe', 0.015706123784184456), ('goodman', 0.013531764037907124), ('remember', 0.013048572465777397), ('measure', 0.011840594932436943)]
Top 10 words of local topic #19
[('dog', 0.09090621769428253), ('willie', 0.052966050803661346), ('hernandez', 0.04190016910433769), ('stitch', 0.02688218653202057), ('gill', 0.02530134655535221), ('fleshman', 0.022139666602015495), ('clayton', 0.022139666602015495), ('jake', 0.020558826625347137), ('torrence', 0.020558826625347137), ('james', 0.019768407568335533)]
Top 10 words of local topic #20
[('webb', 0.062236715108156204), ('store', 0.04835178330540657), ('nike', 0.02975589968264103), ('unattached', 0.025292886421084404), ('fisher', 0.021077819168567657), ('customer', 0.016366859897971153), ('program', 0.015127133578062057), ('finish_line', 0.014879188500344753), ('berian', 0.01264768186956644), ('grow', 0.01264768186956644)]
Top 10 words of local topic #21
[('answ', 0.027544302865862846), ('cassidy', 0.015947720035910606), ('jenny', 0.015706123784184456), ('davis', 0.01522293221205473), ('hayes', 0.014498145319521427), ('anderson', 0.013048572465777397), ('involve', 0.011598999612033367), ('official', 0.011115808971226215), ('play', 0.011115808971226215), ('track_field', 0.009907831437885761)]
Top 10 words of local topic #22
[('washington', 0.0313354954123497), ('heat_prelims', 0.0249281357973814), ('texas', 0.024138187989592552), ('oregon', 0.019749585539102554), ('unattached', 0.0185207761824131), ('byu', 0.018081916496157646), ('california', 0.01737974025309086), ('arkansas', 0.016765335574746132), ('baylor', 0.016150930896401405), ('iowa', 0.016063159331679344)]
Top 10 words of local topic #23
[('macey', 0.040283966809511185), ('stroller', 0.030214231461286545), ('myers', 0.02870377153158188), ('bike', 0.028200285509228706), ('bag', 0.027696799486875534), ('pack', 0.02417239360511303), ('george', 0.023668905720114708), ('cycling', 0.020144499838352203), ('backpack', 0.0166200939565897), ('jorgensen', 0.016116606071591377)]
Top 10 words of local topic #24
[('win_head_head', 0.21126173436641693), ('win_qualified_list', 0.09561096131801605), ('qualifier_enter_win', 0.03440001234412193), ('wins_qualifier_enter', 0.020586922764778137), ('indiana', 0.018149318173527718), ('michigan', 0.01652424782514572), ('qualifier', 0.01652424782514572), ('toledo', 0.015982557088136673), ('arizona', 0.015440867282450199), ('mexico', 0.014357487671077251)]
Top 10 words of local topic #25
[('eat', 0.042596958577632904), ('food', 0.024003785103559494), ('diet', 0.020623210817575455), ('protein', 0.020454181358218193), ('calorie', 0.02011612430214882), ('fuel', 0.01825680583715439), ('meal', 0.01217176765203476), ('pak', 0.01183370966464281), ('nutrition', 0.011664681136608124), ('drink', 0.011326623149216175)]
Top 10 words of local topic #26
[('md_md', 0.03574042767286301), ('kawauchi', 0.028524303808808327), ('food', 0.025775305926799774), ('protein', 0.025088055059313774), ('cook', 0.023713555186986923), ('torrence', 0.022682681679725647), ('harris', 0.021995430812239647), ('jones', 0.021308179944753647), ('martin', 0.019933680072426796), ('frazier', 0.01752830669283867)]
Top 10 words of local topic #27
[('dick', 0.028581636026501656), ('rick', 0.02709316462278366), ('exercise', 0.022627748548984528), ('rep', 0.022627748548984528), ('ineligibility', 0.01905541680753231), ('straight', 0.01905541680753231), ('hand', 0.017269250005483627), ('weight', 0.01637616567313671), ('repeat', 0.016078472137451172), ('lift', 0.014590000733733177)]
Top 10 words of local topic #28
[('sleep', 0.07437365502119064), ('heart', 0.044552031904459), ('symptom', 0.021725604310631752), ('heart_rate', 0.01767575368285179), ('jock', 0.01767575368285179), ('doctor', 0.016203081235289574), ('rutto', 0.013994072563946247), ('flow', 0.012889567762613297), ('carlos', 0.011785062961280346), ('parnov', 0.011785062961280346)]*  
Top 10 words of local topic #29
[('water', 0.029473967850208282), ('eat', 0.017436154186725616), ('salt', 0.016813507303595543), ('drink', 0.015775766223669052), ('sodium', 0.015360668301582336), ('food', 0.01287008449435234), ('fluid', 0.0124549875035882), ('flavor', 0.012247439473867416), ('xc_club', 0.010379502549767494), ('sugar', 0.010171953588724136)]

This model was able to determine some very specific global topics, such as shoe design (#2), multi-events (#3) field events (#6), food (#11), marathons (#14), family relationships (#15), cold weather running (#18), Jamaican sprinting (#22), and companies (#23). The following also finds the following local topics:

  • Upper body training (#0)
  • Food preparation (#1)
  • Pregnancy & childbirth (#3)
  • Beer races (#4)
  • Muscle treatment (#5)
  • Hydration & recovery(#9)
  • Digital technology (#10)
  • Dairy (#11)
  • Dieting & weight management (#13)
  • Nutrients & food sources (#15)
  • Weight loss (#18)
  • Muscle training (#19)
  • Injuries & treatment (#21)
  • Shoe dimensions (#25)
  • Low-impact training (#25)
  • Joints (#29)

Supervised LDAđź”—

Supervised LDA (sLDA) a topic model that learns topics based on the values assigned to the documents to identify the topics that will best predict what the values would be in unassigned documents. These assigned values can be continuous, ordinal, or categorical values.

Labelled LDAđź”—

Labelled LDA (L-LDA) constrains LDA so documents are only modeled as mixtures of the topics corresponding to the document’s assigned labels. This correspondence between topics and labels makes topics easier to interpret, extracting and storing label-specific snippets easier, and enables the model to produce topic summaries as a multinomial distribution of the topic across the model’s vocabulary.

import tomotopy as tp
import json
from collections import Counter

file_name = 'running_articles.json'

with open(file_name, 'r') as input_file:
    data = json.load(input_file)

unique_labels = set()
for index, row in enumerate(data):
    labels = set(row['labels'])
    unique_labels |= set(labels)

unique_label_count = len(unique_labels)
mdl = tp.LLDAModel(k=unique_label_count, min_cf=100, rm_top=200)

for index, row in enumerate(data):
    if index % 1000 == 0 and index > 0:
        print('Adding document #%s' % index)
    content = row['content']
    labels = row['labels']
    clean_document = [token for token in content.strip().split() if len(token) > 2]
    mdl.add_doc(clean_document, labels=labels)

print('Starting training model')
for i in range(0, 100, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

for k in range(mdl.k):
    print('Top 10 words of global topic #{} {}: {}'.format(k, mdl.topic_label_dict[k], mdl.get_topic_words(k, top_n=10)))

The example code above uses a running_articles.json file, which has the following format:

[
    {
         "content": "article content",
         "labels": ["Label 1", "Label 2"]
    },
    ...
]

When executed with that script, the model should produce a output very similar to this:

Top 10 words of global topic #0 NCAA: [('ncaa', 0.010269035585224628), ('oregon', 0.007440074346959591), ('rank', 0.0072905258275568485), ('freshman', 0.006489817518740892), ('squad', 0.005657953675836325), ('girl', 0.004835436120629311), ('spot', 0.004645384848117828), ('arkansas', 0.0043431720696389675), ('top', 0.0043307095766067505), ('sophomore', 0.0043182470835745335)]
Top 10 words of global topic #1 Long Distance Track & Field: [('rupp', 0.006629581097513437), ('lagat', 0.005583586171269417), ('american_record', 0.004688597749918699), ('huddle', 0.004128769505769014), ('flanagan', 0.003918834030628204), ('cain', 0.0038267571944743395), ('kick', 0.003469498362392187), ('simpson', 0.003417935222387314), ('standard', 0.0032153658103197813), ('oregon', 0.003171168966218829)]
Top 10 words of global topic #2 Federation: [('official', 0.005593809764832258), ('president', 0.004463536199182272), ('report', 0.004449137952178717), ('programme', 0.004179168026894331), ('rule', 0.0035204419400542974), ('receive', 0.0034196532797068357), ('future', 0.003398055676370859), ('decision', 0.0033800576347857714), ('usatf', 0.0032504722476005554), ('participant', 0.003228874644264579)]
Top 10 words of global topic #3 Multi-Events: [('heptathlon', 0.012408467940986156), ('decathlon', 0.010574668645858765), ('javelin', 0.007992642931640148), ('pole_vault', 0.007075743284076452), ('germany', 0.005596593022346497), ('overall', 0.005553343333303928), ('discus', 0.005324118304997683), ('german', 0.004900268279016018), ('national_record', 0.004895943216979504), ('register_weekly_newsletter', 0.004225568380206823)]
Top 10 words of global topic #4 European: [('stay_informed_late_news', 0.006548537407070398), ('register_weekly_newsletter', 0.006525514181703329), ('british', 0.004940190818160772), ('berlin', 0.004381051752716303), ('city', 0.004206731915473938), ('farah', 0.0040718805976212025), ('stadium', 0.0038219126872718334), ('host', 0.0037067958619445562), ('award', 0.003634436521679163), ('britain', 0.003446960588917136)]
Top 10 words of global topic #5 Sprint: [('bolt', 0.015239626169204712), ('jamaica', 0.006377596408128738), ('powell', 0.006149361841380596), ('jamaican', 0.005927647929638624), ('gay', 0.005582035519182682), ('sprinter', 0.0053277164697647095), ('johnson', 0.005171212833374739), ('lane', 0.004629970528185368), ('felix', 0.004571281373500824), ('usain_bolt', 0.004538676701486111)]
Top 10 words of global topic #6 Walking: [('china', 0.009112905710935593), ('chinese', 0.007623360026627779), ('walker', 0.006632217206060886), ('national_record', 0.005086035002022982), ('beijing', 0.004644268192350864), ('games', 0.004451703280210495), ('mexico', 0.00366445304825902), ('hurdles', 0.0034945427905768156), ('cuba', 0.0034945427905768156), ('cuban', 0.0033246325328946114)]
Top 10 words of global topic #7 Research: [('study', 0.012948261573910713), ('exercise', 0.010498165152966976), ('percent', 0.005814509466290474), ('researcher', 0.005585940554738045), ('increase', 0.0051969727501273155), ('test', 0.004936323966830969), ('research', 0.004370917100459337), ('effect', 0.004270667675882578), ('muscle', 0.0038215499371290207), ('measure', 0.0037333304062485695)]
Top 10 words of global topic #8 Corporate: [('company', 0.0070938728749752045), ('brand', 0.006842710077762604), ('fan', 0.004655757918953896), ('create', 0.004527113866060972), ('video', 0.004165684804320335), ('nike', 0.003969655372202396), ('offer', 0.003828759305179119), ('photo', 0.003810381516814232), ('story', 0.003693989012390375), ('available', 0.00363272987306118)]
Top 10 words of global topic #9 Race Strategy: [('kilometre', 0.008861791342496872), ('pack', 0.006145339459180832), ('course_record', 0.005742532201111317), ('leader', 0.0056526497937738895), ('ethiopian', 0.005532806273549795), ('ethiopia', 0.005436265375465155), ('pull', 0.004517465829849243), ('compatriot', 0.003908261656761169), ('catch', 0.0038183790165930986), ('cross_line', 0.003778431098908186)]
Top 10 words of global topic #10 Muscles: [('recovery', 0.00707999337464571), ('muscle', 0.007055969908833504), ('exercise', 0.006997628137469292), ('repeat', 0.0041079893708229065), ('strength', 0.00410112552344799), ('allow', 0.004063374828547239), ('stretch', 0.004049647133797407), ('increase', 0.0038471666630357504), ('relate', 0.0036206627264618874), ('faster', 0.0036035035736858845)]
Top 10 words of global topic #11 Marathon: [('boston', 0.010704131796956062), ('boston_marathon', 0.00864144042134285), ('city', 0.007159922271966934), ('finisher', 0.006983664818108082), ('participant', 0.005773679353296757), ('keflezighi', 0.005659349728375673), ('finish_line', 0.004487473983317614), ('community', 0.004125431180000305), ('york', 0.0037824432365596294), ('fun', 0.0037348060868680477)]
Top 10 words of global topic #12 Terrain: [('mountain', 0.00562551524490118), ('route', 0.005140978842973709), ('hour', 0.00497623672708869), ('ultra', 0.004171906039118767), ('colorado', 0.003294895403087139), ('city', 0.0032367510721087456), ('friend', 0.0031398439314216375), ('local', 0.0030235552694648504), ('park', 0.00282489531673491), ('travel', 0.002781287068501115)]
Top 10 words of global topic #13 Field Events: [('height', 0.006656527519226074), ('russian', 0.0060490807518363), ('pole_vault', 0.005596672184765339), ('jumper', 0.0052637201733887196), ('leap', 0.00510359788313508), ('outdoor', 0.004656272940337658), ('national_record', 0.004529192112386227), ('hurdles', 0.004402110818773508), ('russia', 0.004333487246185541), ('clearance', 0.003914120141416788)]
Top 10 words of global topic #14 Tokyo Olympics: [('japanese', 0.025172045454382896), ('japan', 0.015434647910296917), ('leader', 0.006641922984272242), ('debut', 0.00569724990054965), ('osaka', 0.005610049236565828), ('select', 0.005479248706251383), ('sub', 0.0052903140895068645), ('course_record', 0.005086845718324184), ('national_record', 0.0050432453863322735), ('marathon_debut', 0.004738043528050184)]
Top 10 words of global topic #15 Shoe: [('fit', 0.010419311933219433), ('provide_objective_exclusive_datum', 0.010116266086697578), ('shoe_real_world_usage', 0.010116266086697578), ('battery_mechanical_test_lab', 0.010116266086697578), ('addition_shoe_weight_measure', 0.009802200831472874), ('sole_thickness_sit_foot', 0.009802200831472874), ('flexibility_forefoot_account_review', 0.009802200831472874), ('road_foam_cushion_stride', 0.009802200831472874), ('light', 0.0070472415536642075), ('tester', 0.006937042810022831)]
Top 10 words of global topic #16 Food: [('eat', 0.013600863516330719), ('food', 0.011222190223634243), ('protein', 0.005624161567538977), ('diet', 0.005506857298314571), ('water', 0.005265731364488602), ('cup', 0.005083257798105478), ('meal', 0.004992020782083273), ('drink', 0.00471179373562336), ('fuel', 0.004457633942365646), ('calorie', 0.004320778883993626)]
Top 10 words of global topic #17 Weight Loss: [('treadmill', 0.043461889028549194), ('replay', 0.026658862829208374), ('club', 0.019126472994685173), ('viewer', 0.01854705810546875), ('outdoors', 0.013332328759133816), ('flow', 0.011594085022807121), ('outdoor', 0.011594085022807121), ('northeast', 0.01043525617569685), ('camera', 0.009855842217803001), ('northwest', 0.008697012439370155)]
Top 10 words of global topic #18 Injury: [('pain', 0.023611007258296013), ('barefoot', 0.014245044440031052), ('muscle', 0.011708430014550686), ('tendon', 0.010732809081673622), ('treatment', 0.010244998149573803), ('toe', 0.009074253030121326), ('symptom', 0.008488880470395088), ('cause', 0.008196193724870682), ('treat', 0.008001069538295269), ('exercise', 0.008001069538295269)]

As you can see, the words for each topic are relevant to the labels that were manually assigned to the documents. The best part about this model is that all topics are identified by a label provided by the developer, instead of a number determined by the model. Using those labels, we can associate unseen documents/words/passages with known topics in our database and be able to quantify the strength of their relationships. The mode can also be executed repeatedly over time as the documents/labels are added/removed, and be used to update the database at relatively low cost. Lastly, this model can be trained very quickly (less than 30 seconds on a corpus of 82,000 documents) on a low-end dual-core computer made in 2010, making it suitable to be re-trained often and cheaply.

NOTE: tomotopy provides a Label submodule that provides all of the necessary models to generate meaningful labels to topics from models that do not produce labelled topics. Those models can prove very useful for automatically generating labels for topics that will later be used by humans or in LLDA Models.

Partially-Labelled LDAđź”—

Partially labelled Dirichlet Allocation (PLDA) topic models make use of the labels assigned to each document, but also try to discover unlabeled topics within the corpus. In this sense, this model is both a supervised (using the labels) component and an unsupervised component. This model is best used for improving upon existing sets of labels in a corpus, because the model accounts for existing labels as well as attempts to discover new global topics as well as subtopics within the labels you provided to the model.

In my opinion, the PLDA is particularly valuable. This model can be ran repeatedly with the knowledge that new topics found will not correspond to the labels provided. With iteration of the model, bad/uninformative topics can be forgotten and new useful topics to include as additional labels to your documents to further refine the known topics/labels for a next execution of the PLDA model. This creates a feedback loop so the developer can continue until they they have all the information they need, or they stop finding useful topics.

import tomotopy as tp
import json
from collections import Counter

file_name = 'running_articles.tagged.json'

count = 0
with open(file_name, 'r') as input_file:
    data = json.load(input_file)
for row in data:
    label_lookup = row['topics']
    label_values = set(label_lookup.keys())


# labels that proved to be very vague  based on the words with high probabilities
exclude_labels = ('Family/Friends', 'Athen Olympics', 'Crossfit', 'Recovery', 'Injury', 'Joint', 'Injury Treatment')

mdl = tp.PLDAModel(latent_topics=20, min_cf=100, rm_top=200)
count = 0
for row in data:
    if count % 1000 == 0 and count > 0:
        print('Adding document #%s' % count)
    count += 1
    content = row['raw']
    label_lookup = row['topics']
    label_values  = set(label_lookup.keys())
    label_values = [label for label in label_values if label not in exclude_labels]
    clean_document = [token for token in content.strip().split() if len(token) > 2]
    mdl.add_doc(clean_document, labels=label_values)

print('Starting training model')
for i in range(0, 100, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

# print topic distributions of labelled topics
label_count = len(mdl.topic_label_dict)
for k in range(label_count):
    print('Top 10 words of labelled topic #{} {}: {}'.format(k, mdl.topic_label_dict[k], mdl.get_topic_words(k, top_n=10)))

# print topic distributions of un-labelled topics
for k in range(label_count, label_count + mdl.latent_topics):
    print('Top 10 words of new topic #{}: {}'.format(k, mdl.get_topic_words(k, top_n=10)))
Top 10 words of labelled topic #0 NCAA: [('ncaa', 0.01613556779921055), ('rank', 0.011020577512681484), ('oregon', 0.011015341617166996), ('freshman', 0.010586039163172245), ('squad', 0.008654175326228142), ('arkansas', 0.007245851214975119), ('sophomore', 0.007078318390995264), ('spot', 0.007036434952169657), ('conference', 0.0069107855670154095), ('top', 0.006554777733981609)]
Top 10 words of labelled topic #1 Long Distance Track & Field: [('rupp', 0.017559224739670753), ('lagat', 0.014538757503032684), ('cain', 0.01085889432579279), ('american_record', 0.009231671690940857), ('simpson', 0.008529732003808022), ('ncaa', 0.008476555347442627), ('oregon', 0.006998228374868631), ('centrowitz', 0.00656217522919178), ('coburn', 0.006168663967400789), ('hasay', 0.006147393025457859)]
Top 10 words of labelled topic #2 Federation: [('report', 0.007789928000420332), ('decision', 0.007433941587805748), ('test', 0.00730829918757081), ('usatf', 0.006690558046102524), ('rule', 0.006596326362341642), ('federation', 0.005873882677406073), ('issue', 0.005842472426593304), ('president', 0.005601658020168543), ('process', 0.005099088419228792), ('dope', 0.004743102006614208)]
Top 10 words of labelled topic #3 Multi-Events: [('heptathlon', 0.022964442148804665), ('decathlon', 0.019880203530192375), ('javelin', 0.013323995284736156), ('pole_vault', 0.010433623567223549), ('overall', 0.009993018582463264), ('german', 0.008627141825854778), ('germany', 0.008309906348586082), ('discus', 0.008274657651782036), ('lifetime', 0.006979277823120356), ('tzis', 0.0066620418801903725)]
Top 10 words of labelled topic #4 European: [('register_weekly_newsletter', 0.022106200456619263), ('stay_informed_late_news', 0.021985894069075584), ('british', 0.0124918008223176), ('triumph', 0.010576942004263401), ('britain', 0.009454092010855675), ('italian', 0.009113227017223835), ('zurich', 0.007759792264550924), ('berlin', 0.007719690445810556), ('european_champion', 0.007559283636510372), ('farah', 0.007318672724068165)]
Top 10 words of labelled topic #5 Sprint: [('bolt', 0.04664570838212967), ('powell', 0.018393870443105698), ('gay', 0.01711251400411129), ('jamaican', 0.016389166936278343), ('jamaica', 0.01599649339914322), ('johnson', 0.014012457802891731), ('usain_bolt', 0.01320644374936819), ('sprinter', 0.012896438129246235), ('greene', 0.012007755227386951), ('merritt', 0.011222408153116703)]
Top 10 words of labelled topic #6 Walking: [('china', 0.019705742597579956), ('chinese', 0.017356378957629204), ('walker', 0.011350985616445541), ('games', 0.00880364328622818), ('national_record', 0.008262497372925282), ('mexico', 0.007866538129746914), ('cuban', 0.007167008239775896), ('cuba', 0.006863438989967108), ('brazil', 0.006757849827408791), ('beijing', 0.006335492245852947)]
Top 10 words of labelled topic #7 Research: [('study', 0.021403634920716286), ('exercise', 0.016799723729491234), ('researcher', 0.009221079759299755), ('percent', 0.008937457576394081), ('increase', 0.007888715714216232), ('research', 0.007070828694850206), ('test', 0.007037849631160498), ('effect', 0.0067938026040792465), ('muscle', 0.006101237144321203), ('measure', 0.005487821996212006)]
Top 10 words of labelled topic #8 Corporate: [('brand', 0.016838613897562027), ('company', 0.016671232879161835), ('nike', 0.00952409952878952), ('film', 0.00850308034569025), ('create', 0.008452866226434708), ('product', 0.007749869953840971), ('photo', 0.007197515107691288), ('sponsor', 0.007197515107691288), ('video', 0.007063610944896936), ('offer', 0.006745588965713978)]
Top 10 words of labelled topic #9 Race Strategy: [('kilometre', 0.016863549128174782), ('course_record', 0.011076201684772968), ('ethiopian', 0.009951071813702583), ('ethiopia', 0.009439036250114441), ('pack', 0.007579538971185684), ('leader', 0.007336996030062437), ('compatriot', 0.006811486091464758), ('pull', 0.005773940589278936), ('bekele', 0.005403388291597366), ('lead_pack', 0.005295591428875923)]
Top 10 words of labelled topic #10 Muscles: [('muscle', 0.012571621686220169), ('exercise', 0.01204314362257719), ('recovery', 0.011371665634214878), ('repeat', 0.007038145791739225), ('strength', 0.006895145867019892), ('increase', 0.006839189678430557), ('stretch', 0.006671320181339979), ('stride', 0.006242320407181978), ('allow', 0.0054900161921978), ('session', 0.005483798682689667)]
Top 10 words of labelled topic #11 Marathon: [('boston', 0.021748008206486702), ('boston_marathon', 0.02004600130021572), ('finisher', 0.01668250933289528), ('city', 0.014237563125789165), ('participant', 0.013440591283142567), ('finish_line', 0.01036076806485653), ('registration', 0.008104932494461536), ('keflezighi', 0.00783477257937193), ('york', 0.006835181266069412), ('nyrr', 0.006835181266069412)]
Top 10 words of labelled topic #12 Australian Athletics: [('nsw', 0.040423355996608734), ('australian', 0.039854537695646286), ('australia', 0.01684078946709633), ('claim', 0.01092882826924324), ('nsw_athlete', 0.010453260503709316), ('sydney', 0.009324952028691769), ('discus', 0.00822461862117052), ('leap', 0.007553229108452797), ('uts_norths', 0.007385381497442722), ('qualifier', 0.006452895700931549)]
Top 10 words of labelled topic #13 Terrain: [('mountain', 0.015646398067474365), ('route', 0.012933255173265934), ('ultra', 0.010388805530965328), ('colorado', 0.00947505235671997), ('hour', 0.008575357496738434), ('climb', 0.006944660563021898), ('park', 0.006747852545231581), ('town', 0.006241774186491966), ('view', 0.005820041988044977), ('adventure', 0.005763811059296131)]
Top 10 words of labelled topic #14 Field Events: [('height', 0.013304423540830612), ('russian', 0.010528053157031536), ('jumper', 0.010217099450528622), ('pole_vault', 0.009411951526999474), ('leap', 0.008873336017131805), ('outdoor', 0.008334719575941563), ('clearance', 0.007746129296720028), ('world_indoor', 0.006485657300800085), ('national_record', 0.006419024430215359), ('russia', 0.006313522346317768)]
Top 10 words of labelled topic #15 Tokyo Olympics: [('japanese', 0.04623497650027275), ('japan', 0.02760923095047474), ('leader', 0.009917521849274635), ('osaka', 0.009010959416627884), ('takahashi', 0.008598885498940945), ('invite', 0.008543942123651505), ('noguchi', 0.008434055373072624), ('surge', 0.007884623482823372), ('marathon_debut', 0.007857152260839939), ('tokyo', 0.007527492940425873)]
Top 10 words of labelled topic #16 Shoe: [('battery_mechanical_test_lab', 0.01625676266849041), ('shoe_real_world_usage', 0.01625676266849041), ('provide_objective_exclusive_datum', 0.01625676266849041), ('road_foam_cushion_stride', 0.01575206220149994), ('flexibility_forefoot_account_review', 0.01575206220149994), ('sole_thickness_sit_foot', 0.01575206220149994), ('addition_shoe_weight_measure', 0.01575206220149994), ('fit', 0.01283011119812727), ('tester', 0.011076940223574638), ('upper', 0.009669091552495956)]
Top 10 words of labelled topic #17 Food: [('eat', 0.020513825118541718), ('food', 0.017304055392742157), ('protein', 0.008826970122754574), ('diet', 0.008261146023869514), ('meal', 0.007664457429200411), ('cup', 0.007582155987620354), ('calorie', 0.006810576654970646), ('serve', 0.0065636709332466125), ('drink', 0.00655338354408741), ('fuel', 0.006409355439245701)]
Top 10 words of labelled topic #18 Weight Loss: [('treadmill', 0.08076833188533783), ('replay', 0.04703392833471298), ('club', 0.03374461829662323), ('viewer', 0.032722365111112595), ('outdoors', 0.02352207712829113), ('flow', 0.02147756703197956), ('outdoor', 0.020455311983823776), ('northeast', 0.018410803750157356), ('camera', 0.01738854870200157), ('northwest', 0.015344040468335152)]
Top 10 words of labelled topic #19 Anatomy: [('pain', 0.0332828089594841), ('barefoot', 0.020386258140206337), ('muscle', 0.015532718040049076), ('tendon', 0.015394045040011406), ('treatment', 0.013591301627457142), ('toe', 0.013452628627419472), ('cause', 0.011233867146074772), ('treat', 0.011095195077359676), ('symptom', 0.011095195077359676), ('bone', 0.01081785000860691)]



Top 10 words of new topic #20: [('wear', 0.012281172908842564), ('design', 0.008730517700314522), ('pair', 0.00803709588944912), ('fit', 0.006625188048928976), ('style', 0.006407971493899822), ('gear', 0.006240881979465485), ('offer', 0.006207463797181845), ('store', 0.0061072101816535), ('product', 0.005889993626624346), ('pack', 0.005255052819848061)]
Top 10 words of new topic #21: [('cross_line', 0.007639868184924126), ('leader', 0.007586340885609388), ('kick', 0.007523081265389919), ('pack', 0.006909949239343405), ('pull', 0.006768831517547369), ('battle', 0.006501194555312395), ('straight', 0.006301683373749256), ('pair', 0.005922125652432442), ('catch', 0.005620425567030907), ('final_lap', 0.005605827085673809)]
Top 10 words of new topic #22: [('boston', 0.0198564101010561), ('farah', 0.016155317425727844), ('flanagan', 0.015419493429362774), ('hall', 0.012553068809211254), ('keflezighi', 0.011026506312191486), ('huddle', 0.00997218955308199), ('boston_marathon', 0.009950224310159683), ('goucher', 0.008643311448395252), ('york', 0.008105169981718063), ('hill', 0.00689709885045886)]
Top 10 words of new topic #23: [('programme', 0.01538448128849268), ('athletics', 0.012805343605577946), ('des', 0.009095005691051483), ('development', 0.008778269402682781), ('develop', 0.008748103864490986), ('participant', 0.0075867376290261745), ('future', 0.007134257350116968), ('international', 0.0063801235519349575), ('les', 0.0063348752446472645), ('project', 0.006214214023202658)]
Top 10 words of new topic #24: [('city', 0.032513171434402466), ('host', 0.020260779187083244), ('stadium', 0.014702994376420975), ('club', 0.012012521736323833), ('organiser', 0.011557793244719505), ('venue', 0.01088833250105381), ('spectator', 0.009549411945044994), ('local', 0.009410466998815536), ('edition', 0.008564168587327003), ('offer', 0.007919969968497753)]
Top 10 words of new topic #25: [('olympics', 0.014522749930620193), ('beijing', 0.010871338658034801), ('preparation', 0.010567054152488708), ('aim', 0.008455506525933743), ('prepare', 0.008418623358011246), ('look_forward', 0.007800833787769079), ('shape', 0.007745509501546621), ('athens', 0.007478108163923025), ('confirm', 0.0073859007097780704), ('olympic_games', 0.007266031112521887)]
Top 10 words of new topic #26: [('warm', 0.009307012893259525), ('sleep', 0.009129738435149193), ('water', 0.009120874106884003), ('hour', 0.008890417404472828), ('condition', 0.007959725335240364), ('treadmill', 0.007915406487882137), ('winter', 0.007711540441960096), ('cold', 0.0074013094417750835), ('pain', 0.006231296341866255), ('doctor', 0.00556651595979929)]
Top 10 words of new topic #27: [('outdoor', 0.010660163126885891), ('world_indoor', 0.01026079524308443), ('jamaica', 0.00859164074063301), ('diamond_league', 0.008100110106170177), ('rio', 0.007393535692244768), ('dibaba', 0.007096569519490004), ('bolt', 0.006246631965041161), ('felix', 0.005785822402685881), ('beijing', 0.005488856229931116), ('lagat', 0.005058767274022102)]
Top 10 words of new topic #28: [('russia', 0.0393722802400589), ('russian', 0.03528737276792526), ('gbr', 0.021370992064476013), ('rus', 0.01867079921066761), ('ger', 0.017586106434464455), ('ken', 0.014285868965089321), ('walker', 0.014124318957328796), ('germany', 0.013824298046529293), ('france', 0.011954933404922485), ('spain', 0.011124104261398315)]
Top 10 words of new topic #29: [('program', 0.018501216545701027), ('award', 0.01519365981221199), ('community', 0.011055140756070614), ('club', 0.01094923447817564), ('official', 0.009474682621657848), ('president', 0.008855534717440605), ('track_field', 0.008260825648903847), ('volunteer', 0.006924768909811974), ('receive', 0.006533727515488863), ('opportunity', 0.006484847515821457)]
Top 10 words of new topic #30: [('women', 0.01614619791507721), ('pole_vault', 0.015184956602752209), ('girl', 0.015087439678609371), ('championships', 0.01281668245792389), ('world_junior', 0.011465373449027538), ('hurdles', 0.010699166916310787), ('boy', 0.0104066152125597), ('javelin', 0.009111030027270317), ('athens', 0.008233374916017056), ('youth', 0.007885099388659)]
Top 10 words of new topic #31: [('sec', 0.014779885299503803), ('wind', 0.011696704663336277), ('national_record', 0.011135445907711983), ('hurdles', 0.010132663883268833), ('semi_final', 0.009272067807614803), ('javelin', 0.007528423797339201), ('leap', 0.006795045919716358), ('pole_vault', 0.006555575877428055), ('stadium', 0.006031734403222799), ('equal', 0.005388157907873392)]
Top 10 words of new topic #32: [('family', 0.015851624310016632), ('bear', 0.009577351622283459), ('father', 0.009219201281666756), ('olympics', 0.007746803108602762), ('brother', 0.007123355288058519), ('friend', 0.0069774421863257885), ('moment', 0.006858058273792267), ('remember', 0.006592761725187302), ('mother', 0.00639378884807229), ('athens', 0.005942784249782562)]
Top 10 words of new topic #33: [('track_field', 0.02140161767601967), ('host', 0.014523514546453953), ('fan', 0.014087019488215446), ('schedule', 0.013385981321334839), ('women', 0.012235221453011036), ('announce', 0.010132108815014362), ('action', 0.008888759650290012), ('oregon', 0.008849077858030796), ('championships', 0.008346447721123695), ('usatf', 0.007976087741553783)]
Top 10 words of new topic #34: [('moment', 0.007893521338701248), ('bad', 0.0070918407291173935), ('maybe', 0.006958227138966322), ('wasn', 0.00678864074870944), ('bit', 0.006413495633751154), ('guy', 0.006022933404892683), ('actually', 0.005981821566820145), ('wait', 0.005611815024167299), ('probably', 0.00539083918556571), ('pretty', 0.004727910738438368)]
Top 10 words of new topic #35: [('standard', 0.030391905456781387), ('qualifier', 0.011385119520127773), ('trial', 0.010785192251205444), ('entry', 0.010130725800991058), ('average', 0.00902631413191557), ('overall', 0.00902631413191557), ('spot', 0.008153692819178104), ('select', 0.007880998775362968), ('rank', 0.007703747600317001), ('previous', 0.007035647053271532)]
Top 10 words of new topic #36: [('guy', 0.006537304725497961), ('probably', 0.006318273022770882), ('pretty', 0.005732783582061529), ('learn', 0.005673813167959452), ('fun', 0.005239961668848991), ('mind', 0.005159930791705847), ('actually', 0.005054626613855362), ('definitely', 0.0048355949111282825), ('reason', 0.004599714186042547), ('tough', 0.004582865629345179)]
Top 10 words of new topic #37: [('course_record', 0.011670761741697788), ('debut', 0.01113098207861185), ('recent', 0.010419455356895924), ('recently', 0.010108673013746738), ('ethiopia', 0.010002353228628635), ('sub', 0.009928747080266476), ('international', 0.006837284658104181), ('ethiopian', 0.006747321225702763), ('podium', 0.006600108928978443), ('defend_champion', 0.006272970233112574)]
Top 10 words of new topic #38: [('report', 0.020707234740257263), ('accord', 0.01601000502705574), ('official', 0.015511584468185902), ('write', 0.00934930145740509), ('rule', 0.009288886561989784), ('claim', 0.007944663055241108), ('police', 0.007929559797048569), ('post', 0.007008237764239311), ('receive', 0.006842097733169794), ('organizer', 0.006796787027269602)]
Top 10 words of new topic #39: [('family', 0.009664010256528854), ('friend', 0.008047547191381454), ('kid', 0.006173286121338606), ('grow', 0.005992130842059851), ('book', 0.005797040648758411), ('learn', 0.005776138044893742), ('write', 0.0056576901115477085), ('story', 0.005615885369479656), ('idea', 0.005378989968448877), ('child', 0.00501667894423008)]

Dynamic Topic Modelsđź”—

Dynamic Topic Models (DTM) capture how topics and word usage change over time, and can be used to model how topics change/evolve over time based on the time series of documents learned by the model. These models outperform LDA in forecasting topics in future time periods. For more information on how DTMs work, I would highly recommend reading the original research paper, the Wikipedia page, and Modeling Musical Influence with Topic Models

Pachinko modelsđź”—

Pachinko Allocationđź”—

Like LDA, Pachinko allocation (PAM) models the distribution of topics over other topics. PAM is intended as a method for measuring the correlation between topics and their subtopics.

This model is structured as a directed acyclic graph (DAG) where leaf nodes are words in the vocabulary of the corpus, and interior nodes are topics which have Dirichlet distributions over their child nodes. Some implementations of PA include many layers of interior nodes/topics, which allow for more granular examination of the distributions of words and topics across other topics. Tomotopy has an implementation of Pachinko Allocation supporting 2 layers of topic nodes. The tomotopy implementation requires 2 parameters, ( k1 ) and ( k2 ) which correspond to the number of nodes at the first and second levels of the DAG.

import tomotopy as tp
from collections import Counter

file_name = 'running_articles.txt.clean'

mdl = tp.PAModel(k1=5, k2=100, min_cf=100)
for line in open(file_name, 'r'):
    clean_document = line.strip().split()
    mdl.add_doc(clean_document)

print('Starting training model')
iterations = 10
for i in range(0, 100, iterations):
    mdl.train(iterations)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

for k in range(mdl.k1):
    subtopics = mdl.get_sub_topics(k)
    print('\n\nSubtopics of topic #%s' % k)
    for subtopic, probability in subtopics:
        print('    Top 10 words of subtopic topic #%s: probability in supertopic #%s: %r' % (subtopic, k, probability))
        print('    %r' % mdl.get_topic_words(subtopic, top_n=20))
Subtopics of topic #0
    Top 10 words of subtopic topic #3: probability in supertopic #0: 0.025703420862555504
    [('write', 0.008570623584091663), ('mind', 0.008051196113228798), ('matter', 0.007565941195935011), ('learn', 0.007490761112421751), ('reason', 0.007422415539622307), ('book', 0.007012341171503067), ('question', 0.006950829643756151), ('person', 0.006390394642949104), ('word', 0.006342552602291107), ('idea', 0.006342552602291107), ('friend', 0.00608967337757349), ('understand', 0.005713772028684616), ('read', 0.00547456182539463), ('simply', 0.004866284783929586), ('doesn', 0.004326353315263987), ('answer', 0.004162323661148548), ('story', 0.0040529705584049225), ('true', 0.0040529705584049225), ('sense', 0.0038957754150032997), ('share', 0.0037795875687152147)]
    Top 10 words of subtopic topic #31: probability in supertopic #0: 0.022865798324346542
    [('recovery', 0.01834804005920887), ('muscle', 0.010688613168895245), ('exercise', 0.009301993064582348), ('intensity', 0.008955338038504124), ('benefit', 0.008658205159008503), ('increase', 0.008427102118730545), ('fitness', 0.008377579972147942), ('interval', 0.00833631120622158), ('session', 0.008262027986347675), ('treadmill', 0.007841089740395546), ('strength', 0.006883661262691021), ('cross_training', 0.006248127203434706), ('perform', 0.006050038617104292), ('faster', 0.005818935111165047), ('fatigue', 0.005794174037873745), ('maintain', 0.005389743484556675), ('power', 0.0053567285649478436), ('relate', 0.00531546026468277), ('recover', 0.005224669352173805), ('hill', 0.004919282626360655)]
    Top 10 words of subtopic topic #83: probability in supertopic #0: 0.022770950570702553
    [('eat', 0.026321589946746826), ('food', 0.02069835737347603), ('diet', 0.013377685099840164), ('calorie', 0.008599433116614819), ('protein', 0.008113382384181023), ('meal', 0.00809094961732626), ('fuel', 0.007612376473844051), ('drink', 0.007006682455539703), ('fat', 0.006393510848283768), ('water', 0.006229001563042402), ('cup', 0.0061617023311555386), ('sugar', 0.0061617023311555386), ('healthy', 0.0060121482238173485), ('serve', 0.005870071705430746), ('nutrition', 0.00578781682997942), ('consume', 0.005615829955786467), ('carb', 0.0054662758484482765), ('flavor', 0.00510734599083662), ('snack', 0.004845626186579466), ('carbohydrate', 0.004838148597627878)]
    Top 10 words of subtopic topic #5: probability in supertopic #0: 0.02056308090686798
    [('pack', 0.024072770029306412), ('leader', 0.01922924630343914), ('surge', 0.012844948098063469), ('pull', 0.012539844028651714), ('catch', 0.012227112427353859), ('split', 0.011662670411169529), ('kilometre', 0.010518531315028667), ('kick', 0.010144779458642006), ('stay', 0.00983204785734415), ('gap', 0.00982442032545805), ('cross_line', 0.009382019750773907), ('lead_pack', 0.009115053340792656), ('cover', 0.008070073090493679), ('halfway', 0.0074827480129897594), ('quickly', 0.0073607065714895725), ('wind', 0.007345451042056084), ('hit', 0.006437767297029495), ('chase', 0.006407257169485092), ('finish_line', 0.006254705134779215), ('pair', 0.006216567009687424)]
    Top 10 words of subtopic topic #33: probability in supertopic #0: 0.020296484231948853
    [('study', 0.03414563834667206), ('exercise', 0.019984271377325058), ('percent', 0.01409480907022953), ('researcher', 0.013445052318274975), ('research', 0.01106260996311903), ('increase', 0.010712740942835808), ('effect', 0.010079644620418549), ('test', 0.009746436029672623), ('average', 0.008521893993020058), ('measure', 0.007805495988577604), ('compare', 0.007638891693204641), ('difference', 0.007189059630036354), ('benefit', 0.006372698582708836), ('suggest', 0.006247745361179113), ('low', 0.006089471280574799), ('datum', 0.006056150421500206), ('brain', 0.00603948999196291), ('report', 0.005931197199970484), ('trial', 0.0053480821661651134), ('evidence', 0.004939901642501354)]
    Top 10 words of subtopic topic #36: probability in supertopic #0: 0.02010168321430683
    [('shape', 0.01339754182845354), ('difficult', 0.008478627540171146), ('admit', 0.008223761804401875), ('pressure', 0.008121815510094166), ('mind', 0.0077989851124584675), ('confidence', 0.006966422777622938), ('olympics', 0.006652087904512882), ('prepare', 0.006533150561153889), ('confident', 0.006269788835197687), ('excited', 0.006159347016364336), ('feeling', 0.006133860442787409), ('explain', 0.006133860442787409), ('finally', 0.006074391771107912), ('tomorrow', 0.006057400722056627), ('nice', 0.005692092701792717), ('injure', 0.005173865240067244), ('healthy', 0.005071918945759535), ('preparation', 0.004986963234841824), ('special', 0.004842539317905903), ('decision', 0.004808557219803333)]
    Top 10 words of subtopic topic #8: probability in supertopic #0: 0.01974879391491413
    [('friend', 0.006512114312499762), ('remember', 0.0063078152015805244), ('smile', 0.005967317149043083), ('sit', 0.005882192403078079), ('eye', 0.005243758205324411), ('wear', 0.004733010660856962), ('sleep', 0.0046478863805532455), ('hand', 0.00463937409222126), ('real', 0.00452019926160574), ('word', 0.00452019926160574), ('finally', 0.00431590061634779), ('wake', 0.0042988755740225315), ('kid', 0.004264825955033302), ('laugh', 0.00405201455578208), ('call', 0.0039924271404743195), ('hit', 0.0038392029237002134), ('hour', 0.003745565889403224), ('sound', 0.0037200285587459803), ('pull', 0.003711516037583351), ('hair', 0.003711516037583351)]
    Top 10 words of subtopic topic #72: probability in supertopic #0: 0.019616352394223213
    [('semi_final', 0.029286066070199013), ('qualifier', 0.01562768779695034), ('lane', 0.014279311522841454), ('semifinal', 0.011572856456041336), ('advance', 0.010622202418744564), ('jamaica', 0.008352273143827915), ('defend_champion', 0.00829407013952732), ('semi', 0.0073434156365692616), ('tonight', 0.007100902032107115), ('semis', 0.006528569385409355), ('evening', 0.006286055315285921), ('ease', 0.00615994818508625), ('quick', 0.005946536082774401), ('qualifying', 0.005810728296637535), ('spot', 0.005791327450424433), ('cross_line', 0.005704022478312254), ('world_leader', 0.005539112724363804), ('national_record', 0.0054421075619757175), ('straight', 0.0053159999661147594), ('qualification', 0.005141390021890402)]
    Top 10 words of subtopic topic #74: probability in supertopic #0: 0.019237834960222244
    [('community', 0.021511727944016457), ('participant', 0.016647426411509514), ('program', 0.014740399084985256), ('local', 0.011562020517885685), ('participate', 0.01138697937130928), ('create', 0.009774758480489254), ('child', 0.009258847683668137), ('organization', 0.008190175518393517), ('charity', 0.00762820104137063), ('partner', 0.007333395071327686), ('raise', 0.007222842890769243), ('volunteer', 0.007020163349807262), ('host', 0.00646740198135376), ('opportunity', 0.006218659225851297), ('project', 0.0060620433650910854), ('encourage', 0.005979129113256931), ('visit', 0.005840939003974199), ('benefit', 0.005758024752140045), ('grow', 0.005730386823415756), ('kid', 0.005693535786122084)]
    Top 10 words of subtopic topic #54: probability in supertopic #0: 0.01920878142118454
    [('mileage', 0.014087958261370659), ('faster', 0.008924106135964394), ('learn', 0.007750116754323244), ('practice', 0.007682059425860643), ('prepare', 0.007477887440472841), ('build', 0.006924921181052923), ('hit', 0.006865371018648148), ('specific', 0.0064740413799881935), ('approach', 0.006363448221236467), ('schedule', 0.006354941055178642), ('marathoner', 0.006006147246807814), ('repeat', 0.005887046456336975), ('racing', 0.005767946131527424), ('training_plan', 0.00537661649286747), ('recovery', 0.00535960216075182), ('fitness', 0.00490872235968709), ('key', 0.004883200861513615), ('fit', 0.004857678897678852), ('tempo_run', 0.004721564240753651), ('strategy', 0.004679028410464525)]


Subtopics of topic #1
    Top 10 words of subtopic topic #5: probability in supertopic #1: 0.02489171363413334
    [('pack', 0.024072770029306412), ('leader', 0.01922924630343914), ('surge', 0.012844948098063469), ('pull', 0.012539844028651714), ('catch', 0.012227112427353859), ('split', 0.011662670411169529), ('kilometre', 0.010518531315028667), ('kick', 0.010144779458642006), ('stay', 0.00983204785734415), ('gap', 0.00982442032545805), ('cross_line', 0.009382019750773907), ('lead_pack', 0.009115053340792656), ('cover', 0.008070073090493679), ('halfway', 0.0074827480129897594), ('quickly', 0.0073607065714895725), ('wind', 0.007345451042056084), ('hit', 0.006437767297029495), ('chase', 0.006407257169485092), ('finish_line', 0.006254705134779215), ('pair', 0.006216567009687424)]
    Top 10 words of subtopic topic #3: probability in supertopic #1: 0.023838426917791367
    [('write', 0.008570623584091663), ('mind', 0.008051196113228798), ('matter', 0.007565941195935011), ('learn', 0.007490761112421751), ('reason', 0.007422415539622307), ('book', 0.007012341171503067), ('question', 0.006950829643756151), ('person', 0.006390394642949104), ('word', 0.006342552602291107), ('idea', 0.006342552602291107), ('friend', 0.00608967337757349), ('understand', 0.005713772028684616), ('read', 0.00547456182539463), ('simply', 0.004866284783929586), ('doesn', 0.004326353315263987), ('answer', 0.004162323661148548), ('story', 0.0040529705584049225), ('true', 0.0040529705584049225), ('sense', 0.0038957754150032997), ('share', 0.0037795875687152147)]
    Top 10 words of subtopic topic #83: probability in supertopic #1: 0.02324826270341873
    [('eat', 0.026321589946746826), ('food', 0.02069835737347603), ('diet', 0.013377685099840164), ('calorie', 0.008599433116614819), ('protein', 0.008113382384181023), ('meal', 0.00809094961732626), ('fuel', 0.007612376473844051), ('drink', 0.007006682455539703), ('fat', 0.006393510848283768), ('water', 0.006229001563042402), ('cup', 0.0061617023311555386), ('sugar', 0.0061617023311555386), ('healthy', 0.0060121482238173485), ('serve', 0.005870071705430746), ('nutrition', 0.00578781682997942), ('consume', 0.005615829955786467), ('carb', 0.0054662758484482765), ('flavor', 0.00510734599083662), ('snack', 0.004845626186579466), ('carbohydrate', 0.004838148597627878)]
    Top 10 words of subtopic topic #36: probability in supertopic #1: 0.02198244258761406
    [('shape', 0.01339754182845354), ('difficult', 0.008478627540171146), ('admit', 0.008223761804401875), ('pressure', 0.008121815510094166), ('mind', 0.0077989851124584675), ('confidence', 0.006966422777622938), ('olympics', 0.006652087904512882), ('prepare', 0.006533150561153889), ('confident', 0.006269788835197687), ('excited', 0.006159347016364336), ('feeling', 0.006133860442787409), ('explain', 0.006133860442787409), ('finally', 0.006074391771107912), ('tomorrow', 0.006057400722056627), ('nice', 0.005692092701792717), ('injure', 0.005173865240067244), ('healthy', 0.005071918945759535), ('preparation', 0.004986963234841824), ('special', 0.004842539317905903), ('decision', 0.004808557219803333)]
    Top 10 words of subtopic topic #8: probability in supertopic #1: 0.02046428993344307
    [('friend', 0.006512114312499762), ('remember', 0.0063078152015805244), ('smile', 0.005967317149043083), ('sit', 0.005882192403078079), ('eye', 0.005243758205324411), ('wear', 0.004733010660856962), ('sleep', 0.0046478863805532455), ('hand', 0.00463937409222126), ('real', 0.00452019926160574), ('word', 0.00452019926160574), ('finally', 0.00431590061634779), ('wake', 0.0042988755740225315), ('kid', 0.004264825955033302), ('laugh', 0.00405201455578208), ('call', 0.0039924271404743195), ('hit', 0.0038392029237002134), ('hour', 0.003745565889403224), ('sound', 0.0037200285587459803), ('pull', 0.003711516037583351), ('hair', 0.003711516037583351)]
    Top 10 words of subtopic topic #33: probability in supertopic #1: 0.020397748798131943
    [('study', 0.03414563834667206), ('exercise', 0.019984271377325058), ('percent', 0.01409480907022953), ('researcher', 0.013445052318274975), ('research', 0.01106260996311903), ('increase', 0.010712740942835808), ('effect', 0.010079644620418549), ('test', 0.009746436029672623), ('average', 0.008521893993020058), ('measure', 0.007805495988577604), ('compare', 0.007638891693204641), ('difference', 0.007189059630036354), ('benefit', 0.006372698582708836), ('suggest', 0.006247745361179113), ('low', 0.006089471280574799), ('datum', 0.006056150421500206), ('brain', 0.00603948999196291), ('report', 0.005931197199970484), ('trial', 0.0053480821661651134), ('evidence', 0.004939901642501354)]
    Top 10 words of subtopic topic #95: probability in supertopic #1: 0.02024826966226101
    [('hurdles', 0.014207434840500355), ('beijing', 0.011181634850800037), ('world_leader', 0.009999138303101063), ('olympic_silver_medallist', 0.008338426239788532), ('bronze_medallist', 0.008329731412231922), ('compatriot', 0.00810366589576006), ('doha', 0.007677619811147451), ('world_indoor_champion', 0.007677619811147451), ('merritt', 0.00678205257281661), ('berlin', 0.00676466291770339), ('record_holder', 0.006686409469693899), ('wariner', 0.0066342405043542385), ('finalist', 0.005869095679372549), ('medallist', 0.005816926714032888), ('european_champion', 0.005790842231363058), ('olympic_bronze_medallist', 0.005773452576249838), ('top', 0.005616945680230856), ('osaka', 0.005599556025117636), ('appearance', 0.005590861197561026), ('national_record', 0.005512607749551535)]
    Top 10 words of subtopic topic #38: probability in supertopic #1: 0.019874129444360733
    [('australian', 0.051293451339006424), ('nsw', 0.04125262051820755), ('australia', 0.027266850695014), ('sydney', 0.013537365011870861), ('claim', 0.01113928109407425), ('nsw_athlete', 0.010489419102668762), ('melbourne', 0.009574119932949543), ('leap', 0.008155406452715397), ('discus', 0.007862511090934277), ('selection', 0.007212648168206215), ('qualifier', 0.007148577366024256), ('vic', 0.006105136591941118), ('qualifi', 0.005629180930554867), ('championships', 0.005629180930554867), ('representative', 0.005537651013582945), ('javelin', 0.005382050294429064), ('uts_norths', 0.005015930626541376), ('select', 0.004951859824359417), ('national_title', 0.004906094633042812), ('sprinter', 0.004860329907387495)]
    Top 10 words of subtopic topic #31: probability in supertopic #1: 0.019621828570961952
    [('recovery', 0.01834804005920887), ('muscle', 0.010688613168895245), ('exercise', 0.009301993064582348), ('intensity', 0.008955338038504124), ('benefit', 0.008658205159008503), ('increase', 0.008427102118730545), ('fitness', 0.008377579972147942), ('interval', 0.00833631120622158), ('session', 0.008262027986347675), ('treadmill', 0.007841089740395546), ('strength', 0.006883661262691021), ('cross_training', 0.006248127203434706), ('perform', 0.006050038617104292), ('faster', 0.005818935111165047), ('fatigue', 0.005794174037873745), ('maintain', 0.005389743484556675), ('power', 0.0053567285649478436), ('relate', 0.00531546026468277), ('recover', 0.005224669352173805), ('hill', 0.004919282626360655)]
    Top 10 words of subtopic topic #54: probability in supertopic #1: 0.0191993098706007
    [('mileage', 0.014087958261370659), ('faster', 0.008924106135964394), ('learn', 0.007750116754323244), ('practice', 0.007682059425860643), ('prepare', 0.007477887440472841), ('build', 0.006924921181052923), ('hit', 0.006865371018648148), ('specific', 0.0064740413799881935), ('approach', 0.006363448221236467), ('schedule', 0.006354941055178642), ('marathoner', 0.006006147246807814), ('repeat', 0.005887046456336975), ('racing', 0.005767946131527424), ('training_plan', 0.00537661649286747), ('recovery', 0.00535960216075182), ('fitness', 0.00490872235968709), ('key', 0.004883200861513615), ('fit', 0.004857678897678852), ('tempo_run', 0.004721564240753651), ('strategy', 0.004679028410464525)]


Subtopics of topic #2
    Top 10 words of subtopic topic #3: probability in supertopic #2: 0.025822635740041733
    [('write', 0.008570623584091663), ('mind', 0.008051196113228798), ('matter', 0.007565941195935011), ('learn', 0.007490761112421751), ('reason', 0.007422415539622307), ('book', 0.007012341171503067), ('question', 0.006950829643756151), ('person', 0.006390394642949104), ('word', 0.006342552602291107), ('idea', 0.006342552602291107), ('friend', 0.00608967337757349), ('understand', 0.005713772028684616), ('read', 0.00547456182539463), ('simply', 0.004866284783929586), ('doesn', 0.004326353315263987), ('answer', 0.004162323661148548), ('story', 0.0040529705584049225), ('true', 0.0040529705584049225), ('sense', 0.0038957754150032997), ('share', 0.0037795875687152147)]
    Top 10 words of subtopic topic #83: probability in supertopic #2: 0.02392544224858284
    [('eat', 0.026321589946746826), ('food', 0.02069835737347603), ('diet', 0.013377685099840164), ('calorie', 0.008599433116614819), ('protein', 0.008113382384181023), ('meal', 0.00809094961732626), ('fuel', 0.007612376473844051), ('drink', 0.007006682455539703), ('fat', 0.006393510848283768), ('water', 0.006229001563042402), ('cup', 0.0061617023311555386), ('sugar', 0.0061617023311555386), ('healthy', 0.0060121482238173485), ('serve', 0.005870071705430746), ('nutrition', 0.00578781682997942), ('consume', 0.005615829955786467), ('carb', 0.0054662758484482765), ('flavor', 0.00510734599083662), ('snack', 0.004845626186579466), ('carbohydrate', 0.004838148597627878)]
    Top 10 words of subtopic topic #5: probability in supertopic #2: 0.02261176146566868
    [('pack', 0.024072770029306412), ('leader', 0.01922924630343914), ('surge', 0.012844948098063469), ('pull', 0.012539844028651714), ('catch', 0.012227112427353859), ('split', 0.011662670411169529), ('kilometre', 0.010518531315028667), ('kick', 0.010144779458642006), ('stay', 0.00983204785734415), ('gap', 0.00982442032545805), ('cross_line', 0.009382019750773907), ('lead_pack', 0.009115053340792656), ('cover', 0.008070073090493679), ('halfway', 0.0074827480129897594), ('quickly', 0.0073607065714895725), ('wind', 0.007345451042056084), ('hit', 0.006437767297029495), ('chase', 0.006407257169485092), ('finish_line', 0.006254705134779215), ('pair', 0.006216567009687424)]
    Top 10 words of subtopic topic #33: probability in supertopic #2: 0.021786000579595566
    [('study', 0.03414563834667206), ('exercise', 0.019984271377325058), ('percent', 0.01409480907022953), ('researcher', 0.013445052318274975), ('research', 0.01106260996311903), ('increase', 0.010712740942835808), ('effect', 0.010079644620418549), ('test', 0.009746436029672623), ('average', 0.008521893993020058), ('measure', 0.007805495988577604), ('compare', 0.007638891693204641), ('difference', 0.007189059630036354), ('benefit', 0.006372698582708836), ('suggest', 0.006247745361179113), ('low', 0.006089471280574799), ('datum', 0.006056150421500206), ('brain', 0.00603948999196291), ('report', 0.005931197199970484), ('trial', 0.0053480821661651134), ('evidence', 0.004939901642501354)]
    Top 10 words of subtopic topic #8: probability in supertopic #2: 0.021610280498862267
    [('friend', 0.006512114312499762), ('remember', 0.0063078152015805244), ('smile', 0.005967317149043083), ('sit', 0.005882192403078079), ('eye', 0.005243758205324411), ('wear', 0.004733010660856962), ('sleep', 0.0046478863805532455), ('hand', 0.00463937409222126), ('real', 0.00452019926160574), ('word', 0.00452019926160574), ('finally', 0.00431590061634779), ('wake', 0.0042988755740225315), ('kid', 0.004264825955033302), ('laugh', 0.00405201455578208), ('call', 0.0039924271404743195), ('hit', 0.0038392029237002134), ('hour', 0.003745565889403224), ('sound', 0.0037200285587459803), ('pull', 0.003711516037583351), ('hair', 0.003711516037583351)]
    Top 10 words of subtopic topic #36: probability in supertopic #2: 0.02145247533917427
    [('shape', 0.01339754182845354), ('difficult', 0.008478627540171146), ('admit', 0.008223761804401875), ('pressure', 0.008121815510094166), ('mind', 0.0077989851124584675), ('confidence', 0.006966422777622938), ('olympics', 0.006652087904512882), ('prepare', 0.006533150561153889), ('confident', 0.006269788835197687), ('excited', 0.006159347016364336), ('feeling', 0.006133860442787409), ('explain', 0.006133860442787409), ('finally', 0.006074391771107912), ('tomorrow', 0.006057400722056627), ('nice', 0.005692092701792717), ('injure', 0.005173865240067244), ('healthy', 0.005071918945759535), ('preparation', 0.004986963234841824), ('special', 0.004842539317905903), ('decision', 0.004808557219803333)]
    Top 10 words of subtopic topic #31: probability in supertopic #2: 0.02052946202456951
    [('recovery', 0.01834804005920887), ('muscle', 0.010688613168895245), ('exercise', 0.009301993064582348), ('intensity', 0.008955338038504124), ('benefit', 0.008658205159008503), ('increase', 0.008427102118730545), ('fitness', 0.008377579972147942), ('interval', 0.00833631120622158), ('session', 0.008262027986347675), ('treadmill', 0.007841089740395546), ('strength', 0.006883661262691021), ('cross_training', 0.006248127203434706), ('perform', 0.006050038617104292), ('faster', 0.005818935111165047), ('fatigue', 0.005794174037873745), ('maintain', 0.005389743484556675), ('power', 0.0053567285649478436), ('relate', 0.00531546026468277), ('recover', 0.005224669352173805), ('hill', 0.004919282626360655)]
    Top 10 words of subtopic topic #95: probability in supertopic #2: 0.02011573500931263
    [('hurdles', 0.014207434840500355), ('beijing', 0.011181634850800037), ('world_leader', 0.009999138303101063), ('olympic_silver_medallist', 0.008338426239788532), ('bronze_medallist', 0.008329731412231922), ('compatriot', 0.00810366589576006), ('doha', 0.007677619811147451), ('world_indoor_champion', 0.007677619811147451), ('merritt', 0.00678205257281661), ('berlin', 0.00676466291770339), ('record_holder', 0.006686409469693899), ('wariner', 0.0066342405043542385), ('finalist', 0.005869095679372549), ('medallist', 0.005816926714032888), ('european_champion', 0.005790842231363058), ('olympic_bronze_medallist', 0.005773452576249838), ('top', 0.005616945680230856), ('osaka', 0.005599556025117636), ('appearance', 0.005590861197561026), ('national_record', 0.005512607749551535)]
    Top 10 words of subtopic topic #54: probability in supertopic #2: 0.019718216732144356
    [('mileage', 0.014087958261370659), ('faster', 0.008924106135964394), ('learn', 0.007750116754323244), ('practice', 0.007682059425860643), ('prepare', 0.007477887440472841), ('build', 0.006924921181052923), ('hit', 0.006865371018648148), ('specific', 0.0064740413799881935), ('approach', 0.006363448221236467), ('schedule', 0.006354941055178642), ('marathoner', 0.006006147246807814), ('repeat', 0.005887046456336975), ('racing', 0.005767946131527424), ('training_plan', 0.00537661649286747), ('recovery', 0.00535960216075182), ('fitness', 0.00490872235968709), ('key', 0.004883200861513615), ('fit', 0.004857678897678852), ('tempo_run', 0.004721564240753651), ('strategy', 0.004679028410464525)]
    Top 10 words of subtopic topic #38: probability in supertopic #2: 0.017683690413832664
    [('australian', 0.051293451339006424), ('nsw', 0.04125262051820755), ('australia', 0.027266850695014), ('sydney', 0.013537365011870861), ('claim', 0.01113928109407425), ('nsw_athlete', 0.010489419102668762), ('melbourne', 0.009574119932949543), ('leap', 0.008155406452715397), ('discus', 0.007862511090934277), ('selection', 0.007212648168206215), ('qualifier', 0.007148577366024256), ('vic', 0.006105136591941118), ('qualifi', 0.005629180930554867), ('championships', 0.005629180930554867), ('representative', 0.005537651013582945), ('javelin', 0.005382050294429064), ('uts_norths', 0.005015930626541376), ('select', 0.004951859824359417), ('national_title', 0.004906094633042812), ('sprinter', 0.004860329907387495)]


Subtopics of topic #3
    Top 10 words of subtopic topic #3: probability in supertopic #3: 0.025044167414307594
    [('write', 0.008570623584091663), ('mind', 0.008051196113228798), ('matter', 0.007565941195935011), ('learn', 0.007490761112421751), ('reason', 0.007422415539622307), ('book', 0.007012341171503067), ('question', 0.006950829643756151), ('person', 0.006390394642949104), ('word', 0.006342552602291107), ('idea', 0.006342552602291107), ('friend', 0.00608967337757349), ('understand', 0.005713772028684616), ('read', 0.00547456182539463), ('simply', 0.004866284783929586), ('doesn', 0.004326353315263987), ('answer', 0.004162323661148548), ('story', 0.0040529705584049225), ('true', 0.0040529705584049225), ('sense', 0.0038957754150032997), ('share', 0.0037795875687152147)]
    Top 10 words of subtopic topic #54: probability in supertopic #3: 0.023646704852581024
    [('mileage', 0.014087958261370659), ('faster', 0.008924106135964394), ('learn', 0.007750116754323244), ('practice', 0.007682059425860643), ('prepare', 0.007477887440472841), ('build', 0.006924921181052923), ('hit', 0.006865371018648148), ('specific', 0.0064740413799881935), ('approach', 0.006363448221236467), ('schedule', 0.006354941055178642), ('marathoner', 0.006006147246807814), ('repeat', 0.005887046456336975), ('racing', 0.005767946131527424), ('training_plan', 0.00537661649286747), ('recovery', 0.00535960216075182), ('fitness', 0.00490872235968709), ('key', 0.004883200861513615), ('fit', 0.004857678897678852), ('tempo_run', 0.004721564240753651), ('strategy', 0.004679028410464525)]
    Top 10 words of subtopic topic #5: probability in supertopic #3: 0.02332421950995922
    [('pack', 0.024072770029306412), ('leader', 0.01922924630343914), ('surge', 0.012844948098063469), ('pull', 0.012539844028651714), ('catch', 0.012227112427353859), ('split', 0.011662670411169529), ('kilometre', 0.010518531315028667), ('kick', 0.010144779458642006), ('stay', 0.00983204785734415), ('gap', 0.00982442032545805), ('cross_line', 0.009382019750773907), ('lead_pack', 0.009115053340792656), ('cover', 0.008070073090493679), ('halfway', 0.0074827480129897594), ('quickly', 0.0073607065714895725), ('wind', 0.007345451042056084), ('hit', 0.006437767297029495), ('chase', 0.006407257169485092), ('finish_line', 0.006254705134779215), ('pair', 0.006216567009687424)]
    Top 10 words of subtopic topic #83: probability in supertopic #3: 0.02249370515346527
    [('eat', 0.026321589946746826), ('food', 0.02069835737347603), ('diet', 0.013377685099840164), ('calorie', 0.008599433116614819), ('protein', 0.008113382384181023), ('meal', 0.00809094961732626), ('fuel', 0.007612376473844051), ('drink', 0.007006682455539703), ('fat', 0.006393510848283768), ('water', 0.006229001563042402), ('cup', 0.0061617023311555386), ('sugar', 0.0061617023311555386), ('healthy', 0.0060121482238173485), ('serve', 0.005870071705430746), ('nutrition', 0.00578781682997942), ('consume', 0.005615829955786467), ('carb', 0.0054662758484482765), ('flavor', 0.00510734599083662), ('snack', 0.004845626186579466), ('carbohydrate', 0.004838148597627878)]
    Top 10 words of subtopic topic #31: probability in supertopic #3: 0.020764226093888283
    [('recovery', 0.01834804005920887), ('muscle', 0.010688613168895245), ('exercise', 0.009301993064582348), ('intensity', 0.008955338038504124), ('benefit', 0.008658205159008503), ('increase', 0.008427102118730545), ('fitness', 0.008377579972147942), ('interval', 0.00833631120622158), ('session', 0.008262027986347675), ('treadmill', 0.007841089740395546), ('strength', 0.006883661262691021), ('cross_training', 0.006248127203434706), ('perform', 0.006050038617104292), ('faster', 0.005818935111165047), ('fatigue', 0.005794174037873745), ('maintain', 0.005389743484556675), ('power', 0.0053567285649478436), ('relate', 0.00531546026468277), ('recover', 0.005224669352173805), ('hill', 0.004919282626360655)]
    Top 10 words of subtopic topic #33: probability in supertopic #3: 0.020133981481194496
    [('study', 0.03414563834667206), ('exercise', 0.019984271377325058), ('percent', 0.01409480907022953), ('researcher', 0.013445052318274975), ('research', 0.01106260996311903), ('increase', 0.010712740942835808), ('effect', 0.010079644620418549), ('test', 0.009746436029672623), ('average', 0.008521893993020058), ('measure', 0.007805495988577604), ('compare', 0.007638891693204641), ('difference', 0.007189059630036354), ('benefit', 0.006372698582708836), ('suggest', 0.006247745361179113), ('low', 0.006089471280574799), ('datum', 0.006056150421500206), ('brain', 0.00603948999196291), ('report', 0.005931197199970484), ('trial', 0.0053480821661651134), ('evidence', 0.004939901642501354)]
    Top 10 words of subtopic topic #8: probability in supertopic #3: 0.019887786358594894
    [('friend', 0.006512114312499762), ('remember', 0.0063078152015805244), ('smile', 0.005967317149043083), ('sit', 0.005882192403078079), ('eye', 0.005243758205324411), ('wear', 0.004733010660856962), ('sleep', 0.0046478863805532455), ('hand', 0.00463937409222126), ('real', 0.00452019926160574), ('word', 0.00452019926160574), ('finally', 0.00431590061634779), ('wake', 0.0042988755740225315), ('kid', 0.004264825955033302), ('laugh', 0.00405201455578208), ('call', 0.0039924271404743195), ('hit', 0.0038392029237002134), ('hour', 0.003745565889403224), ('sound', 0.0037200285587459803), ('pull', 0.003711516037583351), ('hair', 0.003711516037583351)]
    Top 10 words of subtopic topic #38: probability in supertopic #3: 0.01928180642426014
    [('australian', 0.051293451339006424), ('nsw', 0.04125262051820755), ('australia', 0.027266850695014), ('sydney', 0.013537365011870861), ('claim', 0.01113928109407425), ('nsw_athlete', 0.010489419102668762), ('melbourne', 0.009574119932949543), ('leap', 0.008155406452715397), ('discus', 0.007862511090934277), ('selection', 0.007212648168206215), ('qualifier', 0.007148577366024256), ('vic', 0.006105136591941118), ('qualifi', 0.005629180930554867), ('championships', 0.005629180930554867), ('representative', 0.005537651013582945), ('javelin', 0.005382050294429064), ('uts_norths', 0.005015930626541376), ('select', 0.004951859824359417), ('national_title', 0.004906094633042812), ('sprinter', 0.004860329907387495)]
    Top 10 words of subtopic topic #95: probability in supertopic #3: 0.018945449963212013
    [('hurdles', 0.014207434840500355), ('beijing', 0.011181634850800037), ('world_leader', 0.009999138303101063), ('olympic_silver_medallist', 0.008338426239788532), ('bronze_medallist', 0.008329731412231922), ('compatriot', 0.00810366589576006), ('doha', 0.007677619811147451), ('world_indoor_champion', 0.007677619811147451), ('merritt', 0.00678205257281661), ('berlin', 0.00676466291770339), ('record_holder', 0.006686409469693899), ('wariner', 0.0066342405043542385), ('finalist', 0.005869095679372549), ('medallist', 0.005816926714032888), ('european_champion', 0.005790842231363058), ('olympic_bronze_medallist', 0.005773452576249838), ('top', 0.005616945680230856), ('osaka', 0.005599556025117636), ('appearance', 0.005590861197561026), ('national_record', 0.005512607749551535)]
    Top 10 words of subtopic topic #74: probability in supertopic #3: 0.018943721428513527
    [('community', 0.021511727944016457), ('participant', 0.016647426411509514), ('program', 0.014740399084985256), ('local', 0.011562020517885685), ('participate', 0.01138697937130928), ('create', 0.009774758480489254), ('child', 0.009258847683668137), ('organization', 0.008190175518393517), ('charity', 0.00762820104137063), ('partner', 0.007333395071327686), ('raise', 0.007222842890769243), ('volunteer', 0.007020163349807262), ('host', 0.00646740198135376), ('opportunity', 0.006218659225851297), ('project', 0.0060620433650910854), ('encourage', 0.005979129113256931), ('visit', 0.005840939003974199), ('benefit', 0.005758024752140045), ('grow', 0.005730386823415756), ('kid', 0.005693535786122084)]


Subtopics of topic #4
    Top 10 words of subtopic topic #3: probability in supertopic #4: 0.02532106824219227
    [('write', 0.008570623584091663), ('mind', 0.008051196113228798), ('matter', 0.007565941195935011), ('learn', 0.007490761112421751), ('reason', 0.007422415539622307), ('book', 0.007012341171503067), ('question', 0.006950829643756151), ('person', 0.006390394642949104), ('word', 0.006342552602291107), ('idea', 0.006342552602291107), ('friend', 0.00608967337757349), ('understand', 0.005713772028684616), ('read', 0.00547456182539463), ('simply', 0.004866284783929586), ('doesn', 0.004326353315263987), ('answer', 0.004162323661148548), ('story', 0.0040529705584049225), ('true', 0.0040529705584049225), ('sense', 0.0038957754150032997), ('share', 0.0037795875687152147)]
    Top 10 words of subtopic topic #83: probability in supertopic #4: 0.022474857047200203
    [('eat', 0.026321589946746826), ('food', 0.02069835737347603), ('diet', 0.013377685099840164), ('calorie', 0.008599433116614819), ('protein', 0.008113382384181023), ('meal', 0.00809094961732626), ('fuel', 0.007612376473844051), ('drink', 0.007006682455539703), ('fat', 0.006393510848283768), ('water', 0.006229001563042402), ('cup', 0.0061617023311555386), ('sugar', 0.0061617023311555386), ('healthy', 0.0060121482238173485), ('serve', 0.005870071705430746), ('nutrition', 0.00578781682997942), ('consume', 0.005615829955786467), ('carb', 0.0054662758484482765), ('flavor', 0.00510734599083662), ('snack', 0.004845626186579466), ('carbohydrate', 0.004838148597627878)]
    Top 10 words of subtopic topic #5: probability in supertopic #4: 0.021299513056874275
    [('pack', 0.024072770029306412), ('leader', 0.01922924630343914), ('surge', 0.012844948098063469), ('pull', 0.012539844028651714), ('catch', 0.012227112427353859), ('split', 0.011662670411169529), ('kilometre', 0.010518531315028667), ('kick', 0.010144779458642006), ('stay', 0.00983204785734415), ('gap', 0.00982442032545805), ('cross_line', 0.009382019750773907), ('lead_pack', 0.009115053340792656), ('cover', 0.008070073090493679), ('halfway', 0.0074827480129897594), ('quickly', 0.0073607065714895725), ('wind', 0.007345451042056084), ('hit', 0.006437767297029495), ('chase', 0.006407257169485092), ('finish_line', 0.006254705134779215), ('pair', 0.006216567009687424)]
    Top 10 words of subtopic topic #33: probability in supertopic #4: 0.02052655816078186
    [('study', 0.03414563834667206), ('exercise', 0.019984271377325058), ('percent', 0.01409480907022953), ('researcher', 0.013445052318274975), ('research', 0.01106260996311903), ('increase', 0.010712740942835808), ('effect', 0.010079644620418549), ('test', 0.009746436029672623), ('average', 0.008521893993020058), ('measure', 0.007805495988577604), ('compare', 0.007638891693204641), ('difference', 0.007189059630036354), ('benefit', 0.006372698582708836), ('suggest', 0.006247745361179113), ('low', 0.006089471280574799), ('datum', 0.006056150421500206), ('brain', 0.00603948999196291), ('report', 0.005931197199970484), ('trial', 0.0053480821661651134), ('evidence', 0.004939901642501354)]
    Top 10 words of subtopic topic #95: probability in supertopic #4: 0.02036111429333687
    [('hurdles', 0.014207434840500355), ('beijing', 0.011181634850800037), ('world_leader', 0.009999138303101063), ('olympic_silver_medallist', 0.008338426239788532), ('bronze_medallist', 0.008329731412231922), ('compatriot', 0.00810366589576006), ('doha', 0.007677619811147451), ('world_indoor_champion', 0.007677619811147451), ('merritt', 0.00678205257281661), ('berlin', 0.00676466291770339), ('record_holder', 0.006686409469693899), ('wariner', 0.0066342405043542385), ('finalist', 0.005869095679372549), ('medallist', 0.005816926714032888), ('european_champion', 0.005790842231363058), ('olympic_bronze_medallist', 0.005773452576249838), ('top', 0.005616945680230856), ('osaka', 0.005599556025117636), ('appearance', 0.005590861197561026), ('national_record', 0.005512607749551535)]
    Top 10 words of subtopic topic #31: probability in supertopic #4: 0.020311133936047554
    [('recovery', 0.01834804005920887), ('muscle', 0.010688613168895245), ('exercise', 0.009301993064582348), ('intensity', 0.008955338038504124), ('benefit', 0.008658205159008503), ('increase', 0.008427102118730545), ('fitness', 0.008377579972147942), ('interval', 0.00833631120622158), ('session', 0.008262027986347675), ('treadmill', 0.007841089740395546), ('strength', 0.006883661262691021), ('cross_training', 0.006248127203434706), ('perform', 0.006050038617104292), ('faster', 0.005818935111165047), ('fatigue', 0.005794174037873745), ('maintain', 0.005389743484556675), ('power', 0.0053567285649478436), ('relate', 0.00531546026468277), ('recover', 0.005224669352173805), ('hill', 0.004919282626360655)]
    Top 10 words of subtopic topic #36: probability in supertopic #4: 0.020216362550854683
    [('shape', 0.01339754182845354), ('difficult', 0.008478627540171146), ('admit', 0.008223761804401875), ('pressure', 0.008121815510094166), ('mind', 0.0077989851124584675), ('confidence', 0.006966422777622938), ('olympics', 0.006652087904512882), ('prepare', 0.006533150561153889), ('confident', 0.006269788835197687), ('excited', 0.006159347016364336), ('feeling', 0.006133860442787409), ('explain', 0.006133860442787409), ('finally', 0.006074391771107912), ('tomorrow', 0.006057400722056627), ('nice', 0.005692092701792717), ('injure', 0.005173865240067244), ('healthy', 0.005071918945759535), ('preparation', 0.004986963234841824), ('special', 0.004842539317905903), ('decision', 0.004808557219803333)]
    Top 10 words of subtopic topic #54: probability in supertopic #4: 0.01926417276263237
    [('mileage', 0.014087958261370659), ('faster', 0.008924106135964394), ('learn', 0.007750116754323244), ('practice', 0.007682059425860643), ('prepare', 0.007477887440472841), ('build', 0.006924921181052923), ('hit', 0.006865371018648148), ('specific', 0.0064740413799881935), ('approach', 0.006363448221236467), ('schedule', 0.006354941055178642), ('marathoner', 0.006006147246807814), ('repeat', 0.005887046456336975), ('racing', 0.005767946131527424), ('training_plan', 0.00537661649286747), ('recovery', 0.00535960216075182), ('fitness', 0.00490872235968709), ('key', 0.004883200861513615), ('fit', 0.004857678897678852), ('tempo_run', 0.004721564240753651), ('strategy', 0.004679028410464525)]
    Top 10 words of subtopic topic #8: probability in supertopic #4: 0.019220229238271713
    [('friend', 0.006512114312499762), ('remember', 0.0063078152015805244), ('smile', 0.005967317149043083), ('sit', 0.005882192403078079), ('eye', 0.005243758205324411), ('wear', 0.004733010660856962), ('sleep', 0.0046478863805532455), ('hand', 0.00463937409222126), ('real', 0.00452019926160574), ('word', 0.00452019926160574), ('finally', 0.00431590061634779), ('wake', 0.0042988755740225315), ('kid', 0.004264825955033302), ('laugh', 0.00405201455578208), ('call', 0.0039924271404743195), ('hit', 0.0038392029237002134), ('hour', 0.003745565889403224), ('sound', 0.0037200285587459803), ('pull', 0.003711516037583351), ('hair', 0.003711516037583351)]
    Top 10 words of subtopic topic #74: probability in supertopic #4: 0.018584294244647026
    [('community', 0.021511727944016457), ('participant', 0.016647426411509514), ('program', 0.014740399084985256), ('local', 0.011562020517885685), ('participate', 0.01138697937130928), ('create', 0.009774758480489254), ('child', 0.009258847683668137), ('organization', 0.008190175518393517), ('charity', 0.00762820104137063), ('partner', 0.007333395071327686), ('raise', 0.007222842890769243), ('volunteer', 0.007020163349807262), ('host', 0.00646740198135376), ('opportunity', 0.006218659225851297), ('project', 0.0060620433650910854), ('encourage', 0.005979129113256931), ('visit', 0.005840939003974199), ('benefit', 0.005758024752140045), ('grow', 0.005730386823415756), ('kid', 0.005693535786122084)]

Hierarchical Pachinko Allocationđź”—

Hierarchical Pachinko allocation (HPAM) is a special type of PAM model that creates a hierarchy of topics like HLDA, but since PAM uses a DAG instead of a tree, child topics can have multiple parent topics. HPAM does this by giving each node at every level have a Dirichlet distribution over the vocabulary itself rather than only the lowest level of topics having a distribution over the vocabulary. The Dirichlet distributions at interior nodes are critical to HPAM because their parameters represents the hierarchy through the parameters of these Dirichlet distributions, and also allow computation to be greatly reduced compared to HLDA.

To my knowledge, the only the reason to use HLDA over HPAM would be if topics could only have one parent topic. Otherwise, go with HPAM. Tomotopy has an implementation of HPAM, which takes the same paramters as HPAM, ( k1 ) and ( k2 ) which correspond to the number of nodes at the first and second levels of the DAG.

import tomotopy as tp

    file_name = 'running_articles.txt.clean'

    mdl = tp.HPAModel(k1=5, k2=100, min_cf=100, rm_top=200)
    for line in open(file_name, 'r'):
        document = line.strip().split()
        mdl.add_doc(document)

    print('Starting training model')
    iterations = 10
    for i in range(0, 100, iterations):
        mdl.train(iterations)
        print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

    root_index = 0
    words = mdl.get_topic_words(root_index, top_n=30)
    print('Root Topic #%s' % root_index)
    print('Top 10 words of Root Topic #%s: %r' % (root_index, words))

    for k in range(1, 1 + mdl.k1):
        words = mdl.get_topic_words(k, top_n=30)
        print('\n\nLevel 1 Topic #%s' % k)
        print('    Top 10 words of Level 1 Topic #%s: %r' % (k, words))

    for k in range(1 + mdl.k1, 1 + mdl.k1 + mdl.k2):
        words = mdl.get_topic_words(k, top_n=30)
        print('\n\nLevel 2 Topic #%s' % (k - mdl.k1, ))
        print('    Top 10 words of Level 2 Topic #%s: %r' % (k - mdl.k1, words))

I’ve included the hierarchy of topics below, only including the leaf topics that were in over 100 documents, so that only the more dominant topics are shown:

Root Topic #0
Top 10 words of Root Topic #0: [('hour', 0.003891778877004981), ('idea', 0.002932515926659107), ('reason', 0.002894098637625575), ('grow', 0.0027252964209765196), ('write', 0.002640313236042857), ('hit', 0.002636820776388049), ('share', 0.00257512042298913), ('stay', 0.002544852439314127), ('sign', 0.0024959580041468143), ('drive', 0.0024738390929996967), ('hand', 0.002451720181852579), ('story', 0.0024156314320862293), ('finish_line', 0.002407482359558344), ('doesn', 0.002317842561751604), ('travel', 0.0023120217956602573), ('matter', 0.002295723417773843), ('car', 0.0021711590234190226), ('child', 0.0021618457976728678), ('word', 0.0021571891847997904), ('wear', 0.0021536967251449823), ('finally', 0.00212226458825171), ('job', 0.0020966532174497843), ('sit', 0.0020873399917036295), ('trip', 0.0020768626127392054), ('house', 0.0020698776934295893), ('local', 0.0020477587822824717), ('mother', 0.002012834185734391), ('city', 0.0019895508885383606), ('wife', 0.001983730122447014), ('isn', 0.0019616112112998962)]

Subtopics of Level 1 topic #0
    Top 20 words in Level 2 topic #83: probability in Level 1 topic #0: 0.03294883668422699
    [('del', 0.042800258845090866), ('con', 0.03453117236495018), ('view_athlete_profile', 0.03253205120563507), ('para', 0.028715549036860466), ('de_la', 0.02362687885761261), ('en_el', 0.02244557812809944), ('en_la', 0.021355148404836655), ('gallery', 0.020173851400613785), ('los', 0.019901243969798088), ('metro', 0.019265159964561462), ('por', 0.018083861097693443), ('ward', 0.018083861097693443), ('world_athletics_series', 0.017266038805246353), ('pero', 0.01635734736919403), ('iaaf_anti_dop_news', 0.0162664782255888), ('iaaf_challenge_var', 0.0162664782255888), ('window_location_href_indexof', 0.0162664782255888), ('iaaf_iaaf_sectionselect', 0.0162664782255888), ('home_iaaf_navselect', 0.0162664782255888), ('detail_iaaf_rights_reserve', 0.0162664782255888)]
    Top 20 words in Level 2 topic #88: probability in Level 1 topic #0: 0.029698899015784264
    [('farah', 0.14431841671466827), ('british', 0.058519646525382996), ('mo_farah', 0.03246224299073219), ('lewis', 0.03084893338382244), ('britain', 0.02424902841448784), ('coe', 0.01852911151945591), ('bannister', 0.013004746288061142), ('scott', 0.01266252901405096), ('olympics', 0.01256475318223238), ('jurek', 0.011929205618798733), ('king', 0.010609224438667297), ('ccs', 0.008018150925636292), ('briton', 0.007431492675095797), ('fan', 0.007040387485176325), ('games', 0.007040387485176325), ('sjs', 0.006404840853065252), ('sds', 0.006160400342196226), ('crowd', 0.005475965794175863), ('double_olympic_gold', 0.005378189496695995), ('turner', 0.005280413199216127)]
    Top 20 words in Level 2 topic #24: probability in Level 1 topic #0: 0.026125092059373856
    [('master', 0.04368231073021889), ('kastor', 0.03583258390426636), ('club', 0.030748562887310982), ('answ', 0.021109260618686676), ('masters', 0.01854691468179226), ('daughter', 0.011917351745069027), ('john', 0.011836007237434387), ('future', 0.01069718599319458), ('coaching', 0.00951769296079874), ('grow', 0.00866357795894146), ('study', 0.008297528140246868), ('child', 0.007972151041030884), ('session', 0.007972151041030884), ('dad', 0.007728117983788252), ('attend', 0.00748408492654562), ('father', 0.007280724123120308), ('competitive', 0.007280724123120308), ('pappas', 0.007199379615485668), ('professional', 0.007077363319694996), ('coached', 0.006711313501000404)]
    Top 20 words in Level 2 topic #69: probability in Level 1 topic #0: 0.02252073958516121
    [('centrowitz', 0.048259954899549484), ('loroupe', 0.03256669268012047), ('andrews', 0.027849644422531128), ('manzano', 0.02748679369688034), ('marker_class_redactor_selection', 0.025944681838154793), ('lt_span_id_selection', 0.025944681838154793), ('marker_data_verified_redactor', 0.025853969156742096), ('gt_amp_amp_amp', 0.02340473048388958), ('wheating', 0.020592642948031425), ('amp_lt_span', 0.020411217585206032), ('centro', 0.01514989323914051), ('soccer', 0.013244930654764175), ('gibb', 0.012156380340456963), ('meyer', 0.010886406525969505), ('seiko', 0.010795693844556808), ('tegla_loroupe', 0.007439331617206335), ('finish_line', 0.0073486194014549255), ('arnold', 0.007076481822878122), ('mason', 0.0069857691414654255), ('abera', 0.0063507817685604095)]
    Top 20 words in Level 2 topic #74: probability in Level 1 topic #0: 0.018196970224380493
    [('hammer', 0.050764650106430054), ('poland', 0.031791407614946365), ('polish', 0.026649802923202515), ('hammer_throw', 0.019697776064276695), ('discus', 0.019589150324463844), ('wlodarczyk', 0.017561474815011024), ('javelin', 0.01694593019783497), ('bydgoszcz', 0.012962997891008854), ('throw_metre', 0.01267333049327135), ('pole', 0.012021577917039394), ('fajdek', 0.012021577917039394), ('national_record', 0.011731909587979317), ('thrower', 0.010899115353822708), ('hammer_thrower', 0.008907647803425789), ('pole_vault', 0.008871439844369888), ('pars', 0.008654188364744186), ('shoot', 0.008617980405688286), ('stadium', 0.008147269487380981), ('foul', 0.008074852637946606), ('stefanidi', 0.007821393199265003)]
    Top 20 words in Level 2 topic #38: probability in Level 1 topic #0: 0.017707522958517075
    [('taylor', 0.03784286230802536), ('australia', 0.019456950947642326), ('james', 0.018479472026228905), ('australian', 0.013824810273945332), ('british', 0.012940424494445324), ('amazing', 0.009077055379748344), ('represent', 0.008983961306512356), ('wheelchair', 0.008983961306512356), ('impressive', 0.008937414735555649), ('incredible', 0.008099576458334923), ('kevin', 0.008006482385098934), ('girl', 0.007773749530315399), ('zatopek', 0.007680656388401985), ('bronze_medal', 0.007494470104575157), ('successful', 0.007122097071260214), ('cross_line', 0.007075550500303507), ('wa_athlete', 0.0067497240379452705), ('stewart', 0.0063773514702916145), ('international', 0.006237711291760206), ('paralympics', 0.006191164720803499)]
    Top 20 words in Level 2 topic #47: probability in Level 1 topic #0: 0.016782430931925774
    [('williams', 0.0695289671421051), ('division', 0.034840475767850876), ('seed', 0.031104013323783875), ('spot', 0.024792425334453583), ('pistorius', 0.018884778022766113), ('rank', 0.016158172860741615), ('mcgillivray', 0.012472203932702541), ('wartburg', 0.012320725247263908), ('top', 0.012118754908442497), ('squad', 0.011815798468887806), ('diii', 0.011411856859922409), ('north_central', 0.010906930081546307), ('ranking', 0.010553481057286263), ('mit', 0.009998060762882233), ('uw_la_crosse', 0.00949313398450613), ('uw_oshkosh', 0.008887221105396748), ('haverford', 0.00848327949643135), ('calvin', 0.00818032305687666), ('division_iii', 0.008079337887465954), ('finisher', 0.007523918058723211)]
    Top 20 words in Level 2 topic #17: probability in Level 1 topic #0: 0.01632252149283886
    [('girl', 0.06424186378717422), ('boy', 0.03782522678375244), ('pre', 0.018000073730945587), ('nxn', 0.017797257751226425), ('sophomore', 0.015616999007761478), ('freshman', 0.014247998595237732), ('class', 0.013512793928384781), ('foot_locker', 0.011459293775260448), ('score', 0.01074944157153368), ('california', 0.009963533841073513), ('nation', 0.00907621905207634), ('regional', 0.00902551505714655), ('region', 0.008645237423479557), ('prep', 0.008188904263079166), ('rank', 0.007833978161215782), ('squad', 0.0077325706370174885), ('division', 0.0072762370109558105), ('favorite', 0.006718496326357126), ('teammate', 0.006693144328892231), ('fisher', 0.006464977283030748)]
    Top 20 words in Level 2 topic #54: probability in Level 1 topic #0: 0.014957565814256668
    [('italian', 0.09318435937166214), ('italy', 0.04657537862658501), ('turin', 0.014427738264203072), ('milan', 0.012356982566416264), ('rome', 0.012356982566416264), ('edition', 0.010455960407853127), ('diego_sampaolo_iaaf', 0.01011649239808321), ('test', 0.009946757927536964), ('mountain', 0.009777024388313293), ('represent', 0.008622831664979458), ('bronze_medal', 0.007842054590582848), ('podium', 0.007332852575927973), ('impressive', 0.0067897033877670765), ('france', 0.006687862798571587), ('club', 0.006382341496646404), ('international', 0.006382341496646404), ('highlight', 0.00624655419960618), ('howe', 0.006212607491761446), ('major', 0.006076820194721222), ('florence', 0.00577129889279604)]
    Top 20 words in Level 2 topic #43: probability in Level 1 topic #0: 0.01493541058152914
    [('dibaba', 0.1432160884141922), ('defar', 0.07734237611293793), ('ethiopian', 0.04647086560726166), ('ethiopia', 0.034567940980196), ('ayana', 0.02418685145676136), ('tirunesh_dibaba', 0.022393260151147842), ('meseret_defar', 0.020599668845534325), ('cheruiyot', 0.017175540328025818), ('genzebe_dibaba', 0.01549065113067627), ('robinson', 0.010544686578214169), ('sister', 0.010055525228381157), ('tirunesh', 0.009457660838961601), ('world_indoor', 0.009403309784829617), ('burka', 0.008914148434996605), ('beijing', 0.008044528774917126), ('cherono', 0.007990177720785141), ('almaz_ayana', 0.007011855021119118), ('ethiopians', 0.006577045191079378), ('kick', 0.006250937934964895), ('final_lap', 0.005381317809224129)]


Subtopics of Level 1 topic #1
    Top 20 words in Level 2 topic #83: probability in Level 1 topic #1: 0.03724795952439308
    [('del', 0.042800258845090866), ('con', 0.03453117236495018), ('view_athlete_profile', 0.03253205120563507), ('para', 0.028715549036860466), ('de_la', 0.02362687885761261), ('en_el', 0.02244557812809944), ('en_la', 0.021355148404836655), ('gallery', 0.020173851400613785), ('los', 0.019901243969798088), ('metro', 0.019265159964561462), ('por', 0.018083861097693443), ('ward', 0.018083861097693443), ('world_athletics_series', 0.017266038805246353), ('pero', 0.01635734736919403), ('iaaf_anti_dop_news', 0.0162664782255888), ('iaaf_challenge_var', 0.0162664782255888), ('window_location_href_indexof', 0.0162664782255888), ('iaaf_iaaf_sectionselect', 0.0162664782255888), ('home_iaaf_navselect', 0.0162664782255888), ('detail_iaaf_rights_reserve', 0.0162664782255888)]
    Top 20 words in Level 2 topic #88: probability in Level 1 topic #1: 0.028714383020997047
    [('farah', 0.14431841671466827), ('british', 0.058519646525382996), ('mo_farah', 0.03246224299073219), ('lewis', 0.03084893338382244), ('britain', 0.02424902841448784), ('coe', 0.01852911151945591), ('bannister', 0.013004746288061142), ('scott', 0.01266252901405096), ('olympics', 0.01256475318223238), ('jurek', 0.011929205618798733), ('king', 0.010609224438667297), ('ccs', 0.008018150925636292), ('briton', 0.007431492675095797), ('fan', 0.007040387485176325), ('games', 0.007040387485176325), ('sjs', 0.006404840853065252), ('sds', 0.006160400342196226), ('crowd', 0.005475965794175863), ('double_olympic_gold', 0.005378189496695995), ('turner', 0.005280413199216127)]
    Top 20 words in Level 2 topic #69: probability in Level 1 topic #1: 0.02350527048110962
    [('centrowitz', 0.048259954899549484), ('loroupe', 0.03256669268012047), ('andrews', 0.027849644422531128), ('manzano', 0.02748679369688034), ('marker_class_redactor_selection', 0.025944681838154793), ('lt_span_id_selection', 0.025944681838154793), ('marker_data_verified_redactor', 0.025853969156742096), ('gt_amp_amp_amp', 0.02340473048388958), ('wheating', 0.020592642948031425), ('amp_lt_span', 0.020411217585206032), ('centro', 0.01514989323914051), ('soccer', 0.013244930654764175), ('gibb', 0.012156380340456963), ('meyer', 0.010886406525969505), ('seiko', 0.010795693844556808), ('tegla_loroupe', 0.007439331617206335), ('finish_line', 0.0073486194014549255), ('arnold', 0.007076481822878122), ('mason', 0.0069857691414654255), ('abera', 0.0063507817685604095)]
    Top 20 words in Level 2 topic #38: probability in Level 1 topic #1: 0.016793061047792435
    [('taylor', 0.03784286230802536), ('australia', 0.019456950947642326), ('james', 0.018479472026228905), ('australian', 0.013824810273945332), ('british', 0.012940424494445324), ('amazing', 0.009077055379748344), ('represent', 0.008983961306512356), ('wheelchair', 0.008983961306512356), ('impressive', 0.008937414735555649), ('incredible', 0.008099576458334923), ('kevin', 0.008006482385098934), ('girl', 0.007773749530315399), ('zatopek', 0.007680656388401985), ('bronze_medal', 0.007494470104575157), ('successful', 0.007122097071260214), ('cross_line', 0.007075550500303507), ('wa_athlete', 0.0067497240379452705), ('stewart', 0.0063773514702916145), ('international', 0.006237711291760206), ('paralympics', 0.006191164720803499)]
    Top 20 words in Level 2 topic #47: probability in Level 1 topic #1: 0.016283996403217316
    [('williams', 0.0695289671421051), ('division', 0.034840475767850876), ('seed', 0.031104013323783875), ('spot', 0.024792425334453583), ('pistorius', 0.018884778022766113), ('rank', 0.016158172860741615), ('mcgillivray', 0.012472203932702541), ('wartburg', 0.012320725247263908), ('top', 0.012118754908442497), ('squad', 0.011815798468887806), ('diii', 0.011411856859922409), ('north_central', 0.010906930081546307), ('ranking', 0.010553481057286263), ('mit', 0.009998060762882233), ('uw_la_crosse', 0.00949313398450613), ('uw_oshkosh', 0.008887221105396748), ('haverford', 0.00848327949643135), ('calvin', 0.00818032305687666), ('division_iii', 0.008079337887465954), ('finisher', 0.007523918058723211)]
    Top 20 words in Level 2 topic #43: probability in Level 1 topic #1: 0.01593055948615074
    [('dibaba', 0.1432160884141922), ('defar', 0.07734237611293793), ('ethiopian', 0.04647086560726166), ('ethiopia', 0.034567940980196), ('ayana', 0.02418685145676136), ('tirunesh_dibaba', 0.022393260151147842), ('meseret_defar', 0.020599668845534325), ('cheruiyot', 0.017175540328025818), ('genzebe_dibaba', 0.01549065113067627), ('robinson', 0.010544686578214169), ('sister', 0.010055525228381157), ('tirunesh', 0.009457660838961601), ('world_indoor', 0.009403309784829617), ('burka', 0.008914148434996605), ('beijing', 0.008044528774917126), ('cherono', 0.007990177720785141), ('almaz_ayana', 0.007011855021119118), ('ethiopians', 0.006577045191079378), ('kick', 0.006250937934964895), ('final_lap', 0.005381317809224129)]
    Top 20 words in Level 2 topic #2: probability in Level 1 topic #1: 0.015894418582320213
    [('study', 0.024416295811533928), ('exercise', 0.017118750140070915), ('percent', 0.010360430926084518), ('researcher', 0.009296354837715626), ('increase', 0.008505487814545631), ('effect', 0.008045347407460213), ('research', 0.00803096778690815), ('test', 0.007506119552999735), ('measure', 0.00576621200889349), ('health', 0.005557710770517588), ('average', 0.00550019321963191), ('benefit', 0.005212604999542236), ('compare', 0.005176656413823366), ('difference', 0.005068811122328043), ('datum', 0.005040052346885204), ('report', 0.004896258469671011), ('low', 0.004860309883952141), ('suggest', 0.0048027923330664635), ('brain', 0.0046230498701334), ('question', 0.004457686562091112)]
    Top 20 words in Level 2 topic #74: probability in Level 1 topic #1: 0.015566090121865273
    [('hammer', 0.050764650106430054), ('poland', 0.031791407614946365), ('polish', 0.026649802923202515), ('hammer_throw', 0.019697776064276695), ('discus', 0.019589150324463844), ('wlodarczyk', 0.017561474815011024), ('javelin', 0.01694593019783497), ('bydgoszcz', 0.012962997891008854), ('throw_metre', 0.01267333049327135), ('pole', 0.012021577917039394), ('fajdek', 0.012021577917039394), ('national_record', 0.011731909587979317), ('thrower', 0.010899115353822708), ('hammer_thrower', 0.008907647803425789), ('pole_vault', 0.008871439844369888), ('pars', 0.008654188364744186), ('shoot', 0.008617980405688286), ('stadium', 0.008147269487380981), ('foul', 0.008074852637946606), ('stefanidi', 0.007821393199265003)]
    Top 20 words in Level 2 topic #72: probability in Level 1 topic #1: 0.014757806435227394
    [('talent', 0.01520497351884842), ('bear', 0.014689009636640549), ('father', 0.01386669185012579), ('international', 0.011915703304111958), ('moscow', 0.011835084296762943), ('silver_medal', 0.010835403576493263), ('sprinter', 0.010077581740915775), ('future', 0.009255263954401016), ('bronze_medal', 0.009206892922520638), ('beijing', 0.00919076893478632), ('world_junior', 0.008578062057495117), ('mother', 0.008578062057495117), ('brother', 0.008497442118823528), ('represent', 0.008223336189985275), ('football', 0.0073042758740484715), ('explain', 0.007255903910845518), ('championships', 0.0071269129402935505), ('potential', 0.006562577560544014), ('olympic_games', 0.006530330050736666), ('aim', 0.005691888276487589)]
    Top 20 words in Level 2 topic #54: probability in Level 1 topic #1: 0.014339113608002663
    [('italian', 0.09318435937166214), ('italy', 0.04657537862658501), ('turin', 0.014427738264203072), ('milan', 0.012356982566416264), ('rome', 0.012356982566416264), ('edition', 0.010455960407853127), ('diego_sampaolo_iaaf', 0.01011649239808321), ('test', 0.009946757927536964), ('mountain', 0.009777024388313293), ('represent', 0.008622831664979458), ('bronze_medal', 0.007842054590582848), ('podium', 0.007332852575927973), ('impressive', 0.0067897033877670765), ('france', 0.006687862798571587), ('club', 0.006382341496646404), ('international', 0.006382341496646404), ('highlight', 0.00624655419960618), ('howe', 0.006212607491761446), ('major', 0.006076820194721222), ('florence', 0.00577129889279604)]


Subtopics of Level 1 topic #2
    Top 20 words in Level 2 topic #83: probability in Level 1 topic #2: 0.03812101483345032
    [('del', 0.042800258845090866), ('con', 0.03453117236495018), ('view_athlete_profile', 0.03253205120563507), ('para', 0.028715549036860466), ('de_la', 0.02362687885761261), ('en_el', 0.02244557812809944), ('en_la', 0.021355148404836655), ('gallery', 0.020173851400613785), ('los', 0.019901243969798088), ('metro', 0.019265159964561462), ('por', 0.018083861097693443), ('ward', 0.018083861097693443), ('world_athletics_series', 0.017266038805246353), ('pero', 0.01635734736919403), ('iaaf_anti_dop_news', 0.0162664782255888), ('iaaf_challenge_var', 0.0162664782255888), ('window_location_href_indexof', 0.0162664782255888), ('iaaf_iaaf_sectionselect', 0.0162664782255888), ('home_iaaf_navselect', 0.0162664782255888), ('detail_iaaf_rights_reserve', 0.0162664782255888)]
    Top 20 words in Level 2 topic #88: probability in Level 1 topic #2: 0.032171111553907394
    [('farah', 0.14431841671466827), ('british', 0.058519646525382996), ('mo_farah', 0.03246224299073219), ('lewis', 0.03084893338382244), ('britain', 0.02424902841448784), ('coe', 0.01852911151945591), ('bannister', 0.013004746288061142), ('scott', 0.01266252901405096), ('olympics', 0.01256475318223238), ('jurek', 0.011929205618798733), ('king', 0.010609224438667297), ('ccs', 0.008018150925636292), ('briton', 0.007431492675095797), ('fan', 0.007040387485176325), ('games', 0.007040387485176325), ('sjs', 0.006404840853065252), ('sds', 0.006160400342196226), ('crowd', 0.005475965794175863), ('double_olympic_gold', 0.005378189496695995), ('turner', 0.005280413199216127)]
    Top 20 words in Level 2 topic #69: probability in Level 1 topic #2: 0.02367508038878441
    [('centrowitz', 0.048259954899549484), ('loroupe', 0.03256669268012047), ('andrews', 0.027849644422531128), ('manzano', 0.02748679369688034), ('marker_class_redactor_selection', 0.025944681838154793), ('lt_span_id_selection', 0.025944681838154793), ('marker_data_verified_redactor', 0.025853969156742096), ('gt_amp_amp_amp', 0.02340473048388958), ('wheating', 0.020592642948031425), ('amp_lt_span', 0.020411217585206032), ('centro', 0.01514989323914051), ('soccer', 0.013244930654764175), ('gibb', 0.012156380340456963), ('meyer', 0.010886406525969505), ('seiko', 0.010795693844556808), ('tegla_loroupe', 0.007439331617206335), ('finish_line', 0.0073486194014549255), ('arnold', 0.007076481822878122), ('mason', 0.0069857691414654255), ('abera', 0.0063507817685604095)]
    Top 20 words in Level 2 topic #85: probability in Level 1 topic #2: 0.017983712255954742
    [('register_weekly_newsletter', 0.026035651564598083), ('stay_informed_late_news', 0.025736212730407715), ('russia', 0.024160213768482208), ('germany', 0.020693017169833183), ('france', 0.016012301668524742), ('pole_vault', 0.013931984081864357), ('ukraine', 0.013569504022598267), ('poland', 0.01292334496974945), ('national_record', 0.011930465698242188), ('britain', 0.010259907692670822), ('sweden', 0.010165347717702389), ('european_champion', 0.009203988127410412), ('russian', 0.008967588655650616), ('spain', 0.00880998931825161), ('europe', 0.007407350465655327), ('german', 0.007170950528234243), ('triumph', 0.007123670540750027), ('greece', 0.00680847093462944), ('gold_medallist', 0.00666663097217679), ('italy', 0.00666663097217679)]
    Top 20 words in Level 2 topic #38: probability in Level 1 topic #2: 0.017919333651661873
    [('taylor', 0.03784286230802536), ('australia', 0.019456950947642326), ('james', 0.018479472026228905), ('australian', 0.013824810273945332), ('british', 0.012940424494445324), ('amazing', 0.009077055379748344), ('represent', 0.008983961306512356), ('wheelchair', 0.008983961306512356), ('impressive', 0.008937414735555649), ('incredible', 0.008099576458334923), ('kevin', 0.008006482385098934), ('girl', 0.007773749530315399), ('zatopek', 0.007680656388401985), ('bronze_medal', 0.007494470104575157), ('successful', 0.007122097071260214), ('cross_line', 0.007075550500303507), ('wa_athlete', 0.0067497240379452705), ('stewart', 0.0063773514702916145), ('international', 0.006237711291760206), ('paralympics', 0.006191164720803499)]
    Top 20 words in Level 2 topic #43: probability in Level 1 topic #2: 0.017900632694363594
    [('dibaba', 0.1432160884141922), ('defar', 0.07734237611293793), ('ethiopian', 0.04647086560726166), ('ethiopia', 0.034567940980196), ('ayana', 0.02418685145676136), ('tirunesh_dibaba', 0.022393260151147842), ('meseret_defar', 0.020599668845534325), ('cheruiyot', 0.017175540328025818), ('genzebe_dibaba', 0.01549065113067627), ('robinson', 0.010544686578214169), ('sister', 0.010055525228381157), ('tirunesh', 0.009457660838961601), ('world_indoor', 0.009403309784829617), ('burka', 0.008914148434996605), ('beijing', 0.008044528774917126), ('cherono', 0.007990177720785141), ('almaz_ayana', 0.007011855021119118), ('ethiopians', 0.006577045191079378), ('kick', 0.006250937934964895), ('final_lap', 0.005381317809224129)]
    Top 20 words in Level 2 topic #54: probability in Level 1 topic #2: 0.01782587543129921
    [('italian', 0.09318435937166214), ('italy', 0.04657537862658501), ('turin', 0.014427738264203072), ('milan', 0.012356982566416264), ('rome', 0.012356982566416264), ('edition', 0.010455960407853127), ('diego_sampaolo_iaaf', 0.01011649239808321), ('test', 0.009946757927536964), ('mountain', 0.009777024388313293), ('represent', 0.008622831664979458), ('bronze_medal', 0.007842054590582848), ('podium', 0.007332852575927973), ('impressive', 0.0067897033877670765), ('france', 0.006687862798571587), ('club', 0.006382341496646404), ('international', 0.006382341496646404), ('highlight', 0.00624655419960618), ('howe', 0.006212607491761446), ('major', 0.006076820194721222), ('florence', 0.00577129889279604)]
    Top 20 words in Level 2 topic #74: probability in Level 1 topic #2: 0.017566286027431488
    [('hammer', 0.050764650106430054), ('poland', 0.031791407614946365), ('polish', 0.026649802923202515), ('hammer_throw', 0.019697776064276695), ('discus', 0.019589150324463844), ('wlodarczyk', 0.017561474815011024), ('javelin', 0.01694593019783497), ('bydgoszcz', 0.012962997891008854), ('throw_metre', 0.01267333049327135), ('pole', 0.012021577917039394), ('fajdek', 0.012021577917039394), ('national_record', 0.011731909587979317), ('thrower', 0.010899115353822708), ('hammer_thrower', 0.008907647803425789), ('pole_vault', 0.008871439844369888), ('pars', 0.008654188364744186), ('shoot', 0.008617980405688286), ('stadium', 0.008147269487380981), ('foul', 0.008074852637946606), ('stefanidi', 0.007821393199265003)]
    Top 20 words in Level 2 topic #47: probability in Level 1 topic #2: 0.0166784655302763
    [('williams', 0.0695289671421051), ('division', 0.034840475767850876), ('seed', 0.031104013323783875), ('spot', 0.024792425334453583), ('pistorius', 0.018884778022766113), ('rank', 0.016158172860741615), ('mcgillivray', 0.012472203932702541), ('wartburg', 0.012320725247263908), ('top', 0.012118754908442497), ('squad', 0.011815798468887806), ('diii', 0.011411856859922409), ('north_central', 0.010906930081546307), ('ranking', 0.010553481057286263), ('mit', 0.009998060762882233), ('uw_la_crosse', 0.00949313398450613), ('uw_oshkosh', 0.008887221105396748), ('haverford', 0.00848327949643135), ('calvin', 0.00818032305687666), ('division_iii', 0.008079337887465954), ('finisher', 0.007523918058723211)]
    Top 20 words in Level 2 topic #2: probability in Level 1 topic #2: 0.015462524257600307
    [('study', 0.024416295811533928), ('exercise', 0.017118750140070915), ('percent', 0.010360430926084518), ('researcher', 0.009296354837715626), ('increase', 0.008505487814545631), ('effect', 0.008045347407460213), ('research', 0.00803096778690815), ('test', 0.007506119552999735), ('measure', 0.00576621200889349), ('health', 0.005557710770517588), ('average', 0.00550019321963191), ('benefit', 0.005212604999542236), ('compare', 0.005176656413823366), ('difference', 0.005068811122328043), ('datum', 0.005040052346885204), ('report', 0.004896258469671011), ('low', 0.004860309883952141), ('suggest', 0.0048027923330664635), ('brain', 0.0046230498701334), ('question', 0.004457686562091112)]


Subtopics of Level 1 topic #3
    Top 20 words in Level 2 topic #83: probability in Level 1 topic #3: 0.0429561547935009
    [('del', 0.042800258845090866), ('con', 0.03453117236495018), ('view_athlete_profile', 0.03253205120563507), ('para', 0.028715549036860466), ('de_la', 0.02362687885761261), ('en_el', 0.02244557812809944), ('en_la', 0.021355148404836655), ('gallery', 0.020173851400613785), ('los', 0.019901243969798088), ('metro', 0.019265159964561462), ('por', 0.018083861097693443), ('ward', 0.018083861097693443), ('world_athletics_series', 0.017266038805246353), ('pero', 0.01635734736919403), ('iaaf_anti_dop_news', 0.0162664782255888), ('iaaf_challenge_var', 0.0162664782255888), ('window_location_href_indexof', 0.0162664782255888), ('iaaf_iaaf_sectionselect', 0.0162664782255888), ('home_iaaf_navselect', 0.0162664782255888), ('detail_iaaf_rights_reserve', 0.0162664782255888)]
    Top 20 words in Level 2 topic #88: probability in Level 1 topic #3: 0.02922626957297325
    [('farah', 0.14431841671466827), ('british', 0.058519646525382996), ('mo_farah', 0.03246224299073219), ('lewis', 0.03084893338382244), ('britain', 0.02424902841448784), ('coe', 0.01852911151945591), ('bannister', 0.013004746288061142), ('scott', 0.01266252901405096), ('olympics', 0.01256475318223238), ('jurek', 0.011929205618798733), ('king', 0.010609224438667297), ('ccs', 0.008018150925636292), ('briton', 0.007431492675095797), ('fan', 0.007040387485176325), ('games', 0.007040387485176325), ('sjs', 0.006404840853065252), ('sds', 0.006160400342196226), ('crowd', 0.005475965794175863), ('double_olympic_gold', 0.005378189496695995), ('turner', 0.005280413199216127)]
    Top 20 words in Level 2 topic #69: probability in Level 1 topic #3: 0.020650658756494522
    [('centrowitz', 0.048259954899549484), ('loroupe', 0.03256669268012047), ('andrews', 0.027849644422531128), ('manzano', 0.02748679369688034), ('marker_class_redactor_selection', 0.025944681838154793), ('lt_span_id_selection', 0.025944681838154793), ('marker_data_verified_redactor', 0.025853969156742096), ('gt_amp_amp_amp', 0.02340473048388958), ('wheating', 0.020592642948031425), ('amp_lt_span', 0.020411217585206032), ('centro', 0.01514989323914051), ('soccer', 0.013244930654764175), ('gibb', 0.012156380340456963), ('meyer', 0.010886406525969505), ('seiko', 0.010795693844556808), ('tegla_loroupe', 0.007439331617206335), ('finish_line', 0.0073486194014549255), ('arnold', 0.007076481822878122), ('mason', 0.0069857691414654255), ('abera', 0.0063507817685604095)]
    Top 20 words in Level 2 topic #38: probability in Level 1 topic #3: 0.017427194863557816
    [('taylor', 0.03784286230802536), ('australia', 0.019456950947642326), ('james', 0.018479472026228905), ('australian', 0.013824810273945332), ('british', 0.012940424494445324), ('amazing', 0.009077055379748344), ('represent', 0.008983961306512356), ('wheelchair', 0.008983961306512356), ('impressive', 0.008937414735555649), ('incredible', 0.008099576458334923), ('kevin', 0.008006482385098934), ('girl', 0.007773749530315399), ('zatopek', 0.007680656388401985), ('bronze_medal', 0.007494470104575157), ('successful', 0.007122097071260214), ('cross_line', 0.007075550500303507), ('wa_athlete', 0.0067497240379452705), ('stewart', 0.0063773514702916145), ('international', 0.006237711291760206), ('paralympics', 0.006191164720803499)]
    Top 20 words in Level 2 topic #47: probability in Level 1 topic #3: 0.01590525358915329
    [('williams', 0.0695289671421051), ('division', 0.034840475767850876), ('seed', 0.031104013323783875), ('spot', 0.024792425334453583), ('pistorius', 0.018884778022766113), ('rank', 0.016158172860741615), ('mcgillivray', 0.012472203932702541), ('wartburg', 0.012320725247263908), ('top', 0.012118754908442497), ('squad', 0.011815798468887806), ('diii', 0.011411856859922409), ('north_central', 0.010906930081546307), ('ranking', 0.010553481057286263), ('mit', 0.009998060762882233), ('uw_la_crosse', 0.00949313398450613), ('uw_oshkosh', 0.008887221105396748), ('haverford', 0.00848327949643135), ('calvin', 0.00818032305687666), ('division_iii', 0.008079337887465954), ('finisher', 0.007523918058723211)]
    Top 20 words in Level 2 topic #74: probability in Level 1 topic #3: 0.01499514002352953
    [('hammer', 0.050764650106430054), ('poland', 0.031791407614946365), ('polish', 0.026649802923202515), ('hammer_throw', 0.019697776064276695), ('discus', 0.019589150324463844), ('wlodarczyk', 0.017561474815011024), ('javelin', 0.01694593019783497), ('bydgoszcz', 0.012962997891008854), ('throw_metre', 0.01267333049327135), ('pole', 0.012021577917039394), ('fajdek', 0.012021577917039394), ('national_record', 0.011731909587979317), ('thrower', 0.010899115353822708), ('hammer_thrower', 0.008907647803425789), ('pole_vault', 0.008871439844369888), ('pars', 0.008654188364744186), ('shoot', 0.008617980405688286), ('stadium', 0.008147269487380981), ('foul', 0.008074852637946606), ('stefanidi', 0.007821393199265003)]
    Top 20 words in Level 2 topic #17: probability in Level 1 topic #3: 0.014950492419302464
    [('girl', 0.06424186378717422), ('boy', 0.03782522678375244), ('pre', 0.018000073730945587), ('nxn', 0.017797257751226425), ('sophomore', 0.015616999007761478), ('freshman', 0.014247998595237732), ('class', 0.013512793928384781), ('foot_locker', 0.011459293775260448), ('score', 0.01074944157153368), ('california', 0.009963533841073513), ('nation', 0.00907621905207634), ('regional', 0.00902551505714655), ('region', 0.008645237423479557), ('prep', 0.008188904263079166), ('rank', 0.007833978161215782), ('squad', 0.0077325706370174885), ('division', 0.0072762370109558105), ('favorite', 0.006718496326357126), ('teammate', 0.006693144328892231), ('fisher', 0.006464977283030748)]
    Top 20 words in Level 2 topic #84: probability in Level 1 topic #3: 0.014894692227244377
    [('height', 0.04479258134961128), ('jumper', 0.034641049802303314), ('clearance', 0.020669614896178246), ('bar', 0.019125953316688538), ('vlasic', 0.01658807136118412), ('holm', 0.013893206603825092), ('swedish', 0.01305596623569727), ('sweden', 0.011826271191239357), ('equal', 0.010308774188160896), ('fail', 0.009916318580508232), ('jumping', 0.009602352976799011), ('failure', 0.00949769839644432), ('evening', 0.008869769051671028), ('outdoor', 0.008843605406582355), ('russian', 0.0086342953145504), ('bergqvist', 0.008608131669461727), ('ukhov', 0.008424985222518444), ('pole_vault', 0.007901710458099842), ('kallur', 0.007770891766995192), ('indoor_season', 0.007770891766995192)]
    Top 20 words in Level 2 topic #54: probability in Level 1 topic #3: 0.01467654388397932
    [('italian', 0.09318435937166214), ('italy', 0.04657537862658501), ('turin', 0.014427738264203072), ('milan', 0.012356982566416264), ('rome', 0.012356982566416264), ('edition', 0.010455960407853127), ('diego_sampaolo_iaaf', 0.01011649239808321), ('test', 0.009946757927536964), ('mountain', 0.009777024388313293), ('represent', 0.008622831664979458), ('bronze_medal', 0.007842054590582848), ('podium', 0.007332852575927973), ('impressive', 0.0067897033877670765), ('france', 0.006687862798571587), ('club', 0.006382341496646404), ('international', 0.006382341496646404), ('highlight', 0.00624655419960618), ('howe', 0.006212607491761446), ('major', 0.006076820194721222), ('florence', 0.00577129889279604)]
    Top 20 words in Level 2 topic #5: probability in Level 1 topic #3: 0.01461668312549591
    [('recovery', 0.011465105228126049), ('faster', 0.005792636424303055), ('increase', 0.005792636424303055), ('exercise', 0.0055434927344322205), ('hour', 0.005378880072385073), ('muscle', 0.005356634967029095), ('benefit', 0.005321043077856302), ('fitness', 0.005151981022208929), ('interval', 0.004974021576344967), ('mileage', 0.004960674326866865), ('sleep', 0.004951776470988989), ('session', 0.004933980293571949), ('intensity', 0.00484944973140955), ('relate', 0.0047782654874026775), ('specific', 0.004711530636996031), ('practice', 0.004426795057952404), ('perform', 0.0043467129580676556), ('recover', 0.004284427035599947), ('repeat', 0.003964099567383528), ('build', 0.0039329566061496735)]


Subtopics of Level 1 topic #4
    Top 20 words in Level 2 topic #83: probability in Level 1 topic #4: 0.035008203238248825
    [('del', 0.042800258845090866), ('con', 0.03453117236495018), ('view_athlete_profile', 0.03253205120563507), ('para', 0.028715549036860466), ('de_la', 0.02362687885761261), ('en_el', 0.02244557812809944), ('en_la', 0.021355148404836655), ('gallery', 0.020173851400613785), ('los', 0.019901243969798088), ('metro', 0.019265159964561462), ('por', 0.018083861097693443), ('ward', 0.018083861097693443), ('world_athletics_series', 0.017266038805246353), ('pero', 0.01635734736919403), ('iaaf_anti_dop_news', 0.0162664782255888), ('iaaf_challenge_var', 0.0162664782255888), ('window_location_href_indexof', 0.0162664782255888), ('iaaf_iaaf_sectionselect', 0.0162664782255888), ('home_iaaf_navselect', 0.0162664782255888), ('detail_iaaf_rights_reserve', 0.0162664782255888)]
    Top 20 words in Level 2 topic #88: probability in Level 1 topic #4: 0.026329925283789635
    [('farah', 0.14431841671466827), ('british', 0.058519646525382996), ('mo_farah', 0.03246224299073219), ('lewis', 0.03084893338382244), ('britain', 0.02424902841448784), ('coe', 0.01852911151945591), ('bannister', 0.013004746288061142), ('scott', 0.01266252901405096), ('olympics', 0.01256475318223238), ('jurek', 0.011929205618798733), ('king', 0.010609224438667297), ('ccs', 0.008018150925636292), ('briton', 0.007431492675095797), ('fan', 0.007040387485176325), ('games', 0.007040387485176325), ('sjs', 0.006404840853065252), ('sds', 0.006160400342196226), ('crowd', 0.005475965794175863), ('double_olympic_gold', 0.005378189496695995), ('turner', 0.005280413199216127)]
    Top 20 words in Level 2 topic #69: probability in Level 1 topic #4: 0.01829768717288971
    [('centrowitz', 0.048259954899549484), ('loroupe', 0.03256669268012047), ('andrews', 0.027849644422531128), ('manzano', 0.02748679369688034), ('marker_class_redactor_selection', 0.025944681838154793), ('lt_span_id_selection', 0.025944681838154793), ('marker_data_verified_redactor', 0.025853969156742096), ('gt_amp_amp_amp', 0.02340473048388958), ('wheating', 0.020592642948031425), ('amp_lt_span', 0.020411217585206032), ('centro', 0.01514989323914051), ('soccer', 0.013244930654764175), ('gibb', 0.012156380340456963), ('meyer', 0.010886406525969505), ('seiko', 0.010795693844556808), ('tegla_loroupe', 0.007439331617206335), ('finish_line', 0.0073486194014549255), ('arnold', 0.007076481822878122), ('mason', 0.0069857691414654255), ('abera', 0.0063507817685604095)]
    Top 20 words in Level 2 topic #24: probability in Level 1 topic #4: 0.01721481792628765
    [('master', 0.04368231073021889), ('kastor', 0.03583258390426636), ('club', 0.030748562887310982), ('answ', 0.021109260618686676), ('masters', 0.01854691468179226), ('daughter', 0.011917351745069027), ('john', 0.011836007237434387), ('future', 0.01069718599319458), ('coaching', 0.00951769296079874), ('grow', 0.00866357795894146), ('study', 0.008297528140246868), ('child', 0.007972151041030884), ('session', 0.007972151041030884), ('dad', 0.007728117983788252), ('attend', 0.00748408492654562), ('father', 0.007280724123120308), ('competitive', 0.007280724123120308), ('pappas', 0.007199379615485668), ('professional', 0.007077363319694996), ('coached', 0.006711313501000404)]
    Top 20 words in Level 2 topic #74: probability in Level 1 topic #4: 0.0167548805475235
    [('hammer', 0.050764650106430054), ('poland', 0.031791407614946365), ('polish', 0.026649802923202515), ('hammer_throw', 0.019697776064276695), ('discus', 0.019589150324463844), ('wlodarczyk', 0.017561474815011024), ('javelin', 0.01694593019783497), ('bydgoszcz', 0.012962997891008854), ('throw_metre', 0.01267333049327135), ('pole', 0.012021577917039394), ('fajdek', 0.012021577917039394), ('national_record', 0.011731909587979317), ('thrower', 0.010899115353822708), ('hammer_thrower', 0.008907647803425789), ('pole_vault', 0.008871439844369888), ('pars', 0.008654188364744186), ('shoot', 0.008617980405688286), ('stadium', 0.008147269487380981), ('foul', 0.008074852637946606), ('stefanidi', 0.007821393199265003)]
    Top 20 words in Level 2 topic #38: probability in Level 1 topic #4: 0.01598154567182064
    [('taylor', 0.03784286230802536), ('australia', 0.019456950947642326), ('james', 0.018479472026228905), ('australian', 0.013824810273945332), ('british', 0.012940424494445324), ('amazing', 0.009077055379748344), ('represent', 0.008983961306512356), ('wheelchair', 0.008983961306512356), ('impressive', 0.008937414735555649), ('incredible', 0.008099576458334923), ('kevin', 0.008006482385098934), ('girl', 0.007773749530315399), ('zatopek', 0.007680656388401985), ('bronze_medal', 0.007494470104575157), ('successful', 0.007122097071260214), ('cross_line', 0.007075550500303507), ('wa_athlete', 0.0067497240379452705), ('stewart', 0.0063773514702916145), ('international', 0.006237711291760206), ('paralympics', 0.006191164720803499)]
    Top 20 words in Level 2 topic #72: probability in Level 1 topic #4: 0.014242026023566723
    [('talent', 0.01520497351884842), ('bear', 0.014689009636640549), ('father', 0.01386669185012579), ('international', 0.011915703304111958), ('moscow', 0.011835084296762943), ('silver_medal', 0.010835403576493263), ('sprinter', 0.010077581740915775), ('future', 0.009255263954401016), ('bronze_medal', 0.009206892922520638), ('beijing', 0.00919076893478632), ('world_junior', 0.008578062057495117), ('mother', 0.008578062057495117), ('brother', 0.008497442118823528), ('represent', 0.008223336189985275), ('football', 0.0073042758740484715), ('explain', 0.007255903910845518), ('championships', 0.0071269129402935505), ('potential', 0.006562577560544014), ('olympic_games', 0.006530330050736666), ('aim', 0.005691888276487589)]
    Top 20 words in Level 2 topic #43: probability in Level 1 topic #4: 0.01378979254513979
    [('dibaba', 0.1432160884141922), ('defar', 0.07734237611293793), ('ethiopian', 0.04647086560726166), ('ethiopia', 0.034567940980196), ('ayana', 0.02418685145676136), ('tirunesh_dibaba', 0.022393260151147842), ('meseret_defar', 0.020599668845534325), ('cheruiyot', 0.017175540328025818), ('genzebe_dibaba', 0.01549065113067627), ('robinson', 0.010544686578214169), ('sister', 0.010055525228381157), ('tirunesh', 0.009457660838961601), ('world_indoor', 0.009403309784829617), ('burka', 0.008914148434996605), ('beijing', 0.008044528774917126), ('cherono', 0.007990177720785141), ('almaz_ayana', 0.007011855021119118), ('ethiopians', 0.006577045191079378), ('kick', 0.006250937934964895), ('final_lap', 0.005381317809224129)]
    Top 20 words in Level 2 topic #13: probability in Level 1 topic #4: 0.013541013933718204
    [('cain', 0.026040317490696907), ('track_field', 0.017838211730122566), ('eugene', 0.01577187143266201), ('oregon', 0.013663360849022865), ('adidas', 0.012124147266149521), ('rank', 0.01102772168815136), ('millrose_games', 0.01003672182559967), ('american_record', 0.009762615896761417), ('armory', 0.009024636819958687), ('professional', 0.008961381390690804), ('california', 0.008455338887870312), ('outdoor', 0.008434253744781017), ('hayward_field', 0.008413168601691723), ('usatf', 0.008181232959032059), ('announce', 0.008181232959032059), ('fan', 0.007970381528139114), ('york', 0.007633019704371691), ('record_holder', 0.007527594454586506), ('ncaa_champion', 0.0075065093114972115), ('excited', 0.007126977201551199)]
    Top 20 words in Level 2 topic #47: probability in Level 1 topic #4: 0.01352944690734148
    [('williams', 0.0695289671421051), ('division', 0.034840475767850876), ('seed', 0.031104013323783875), ('spot', 0.024792425334453583), ('pistorius', 0.018884778022766113), ('rank', 0.016158172860741615), ('mcgillivray', 0.012472203932702541), ('wartburg', 0.012320725247263908), ('top', 0.012118754908442497), ('squad', 0.011815798468887806), ('diii', 0.011411856859922409), ('north_central', 0.010906930081546307), ('ranking', 0.010553481057286263), ('mit', 0.009998060762882233), ('uw_la_crosse', 0.00949313398450613), ('uw_oshkosh', 0.008887221105396748), ('haverford', 0.00848327949643135), ('calvin', 0.00818032305687666), ('division_iii', 0.008079337887465954), ('finisher', 0.007523918058723211)]

Assigning Labels to Topicsđź”—

Topic models do not assign labels to topics and are normally assigned by humans based on the topic distributions. But tomotopy includes model implementations for automatically assigning labels to topics as a two-part process:

  1. Pointwise Mutual Information Extraction , through the PMIExtractor model. This model finds words that should be considered as potentials labels for topics.
  2. First Order Relevance through the FoRelevance based on Automatic labeling of multinomial topic models. In short, this step ranks word/phrase matches for each topic based on their similar to the topic distribution in the context of the corpus provided. The important part is that it accounts for the context of the corpus, which is intended to make the more highly-ranked label candidates more intuitive to a human. According to the original paper the use of our method is not limited to labeling topic models; our method can also be used in any text management tasks where a multinomial distribution over words can be estimated, such as labeling document clusters and summarizing text making this model useful for a couple of tasks outside of topic modelling as well.

tomotopy provides an end-to-end example for assigning labels that can be used with any of the tomotopy models.

Applicationsđź”—

We have lightly covered several types of topic models. Each of this models are very useful in discovering topics in documents. I would highly recommend topic modelling before getting started on a machine learning text classification project as these topic models can reveal the existence of topics that may be worth identifying, but would never be manually spotted. Topic discovery does not need to be done entirely manually. Some of these models can be used to vastly automate and speed up topic identification and text classification. The combination of these models can used to structure and refine topics:

  1. Unsupervised models: Initial topic discovery
  2. Supervised models: Word-topic relationship refinement
  3. Partially supervised models: Topic refinement and discovery
  4. Hierarchical models: Vertical relationship between topics (Is a topic a more general case of another topic? etc)
An example workflow for processing corpuses of text in the context of topic modelling. A feedback loop can be created using PLDA models and LLDA models to manually filter out incoherent topics. By repeatedly finding new topics outside of the current set of known topics, more topics are found with each iteration, until there are no more coherent topics to be identified or the user is satisfied. The words that contribute solely to incoherent topics could be considered stopwords and removed from the preprocessed documents.

The topic compositions of a given corpus of text can be analyzed using compositional data analysis to remove outliers make inferences about the relationships between topics.

Lastly, topic models can also be applied to data other than text. For example, topic models have been adapted to create embeddings of nodes in a graph. Happy modelling!