D
D
Dash2016-04-29 11:18:04
Python
Dash, 2016-04-29 11:18:04

How to determine the dictionary limit when uploading?

The bottom line:
1) I have a large number of texts from VK
2) as well as a dictionary of words that I work with
I need to extract from this test database only those texts that relate to the selected words (let it be "apartment" and "house" )
I sort of did it...
BUT...
I need the downloaded texts to contain no other words from my dictionary!
those. if the text contains "apartment", "chair", "wardrobe" - then this text should not be unloaded
i.e. as a result, I should have a set of texts where only one word from the dictionary occurs, and there should not be others there
, in fact, the code itself:

import csv
from collections import Counter
house_list = set(["квартира", "дом"]	)
in_csv = open("C:\\Hun\\texts_for_topicminer\\Vk_csv_full_lem_CORRECTED.csv", "rt", newline="")
out_csv =  open("C:\\Hun\\dasha\\house_counter.csv", "wt", newline="")
full_house = open("C:\\Hun\\dasha\\house_list-2.csv", "rt", newline="")
reader = csv.reader(in_csv, delimiter=";")
writer = csv.writer(out_csv)
full_house_reader = csv.reader(full_house, delimiter=";")
full_house_list = set()
for row in full_house_reader:
  full_house_list.add(row[0])
print(full_house_list)
for house in house_list:
  full_house_list.remove(house)
writer.writerow(["line_number", "auth_id", "date", "text", "city", "region", "text_length", "квартира", "дом"])
for num, row in enumerate(reader):
  words_list = row[0].split()
  if set(full_house_list).issubset(words_list):
    continue
  else:
    cnt = Counter(words_list)
    two_house = False
    for  house in house_list:
      if cnt[house] != 0:
        two_house = True
    if two_house:
      house_counter = {}
      for house in house_list:
        house_counter[houses] = cnt[house]
      writer.writerow([num + 1, row[1], row[4], row[0], row[7], row[8], len(words_list), house_counter["квартира"], house_counter["дом"]])

How can I do that? how to write it in code?

Answer the question

In order to leave comments, you need to log in

1 answer(s)
A
Anatoly Scherbakov, 2016-04-29
@Altaisoft

The code is not idiomatic. It's not clear what this does. full_house_list is a list of words or texts?

for house in house_list:
  full_house_list.remove(house)

1. I would suggest that you need to split the input texts into words first of all and then find the intersections of the set of allowed words and the set of forbidden words with the text, respectively. But then both sets must include all word forms.
2. Well, in general, in a more advanced form, you will need a stemming engine.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question