How to delete words from txt file, that exists on another txt file?

File a.txt has about 100k words, each words is in new line

july.cpp
windows.exe
ttm.rar
document.zip

File b.txt has 150k words, one word by line - some words are from file a.txt, but some words are new:

july.cpp
NOVEMBER.txt
windows.exe
ttm.rar
document.zip
diary.txt

How can I merge this files into one, delete all duplicate lines, and keep lines that are new (lines that exist in a.txt but don't exist in b.txt, and vice versa)?

4 Answers

There is a command to do this: comm. As stated in man comm, it is plain simple:

 comm -3 file1 file2 Print lines in file1 not in file2, and vice versa.

Note that comm expects files contents to be sorted, so You must sort them before calling comm on them, just like that:

sort unsorted-file.txt > sorted-file.txt

So to sum up:

sort a.txt > as.txt
sort b.txt > bs.txt
comm -3 as.txt bs.txt > result.txt

After above commands, You will have expected lines in the result.txt file.

Here is a short python3 script, based on Germar's answer, which should accomplish this while retaining b.txt's unsorted order.

#!/usr/bin/python3
with open('a.txt', 'r') as afile: a = set(line.rstrip('\n') for line in afile)
with open('b.txt', 'r') as bfile: for line in bfile: line = line.rstrip('\n') if line not in a: print(line) # Uncomment the following if you also want to remove duplicates: # a.add(line)

#!/usr/bin/env python3
with open('a.txt', 'r') as f: a_txt = f.read()
a = a_txt.split('\n')
del(a_txt)
with open('b.txt', 'r') as f: while True: b = f.readline().strip('\n ') if not len(b): break if not b in a: print(b)

Have a look at the coreutils comm command - man comm

NAME comm - compare two sorted files line by line
SYNOPSIS comm [OPTION]... FILE1 FILE2
DESCRIPTION Compare sorted files FILE1 and FILE2 line by line. With no options, produce three-column output. Column one contains lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files. -1 suppress column 1 (lines unique to FILE1) -2 suppress column 2 (lines unique to FILE2) -3 suppress column 3 (lines that appear in both files)

So for example you can do

$ comm -13 <(sort a.txt) <(sort b.txt)
diary.txt
NOVEMBER.txt

(lines unique to b.txt)

How to delete words from txt file, that exists on another txt file?

4 Answers

Your Answer

Sign up or log in

Post as a guest

You Might Also Like

Are the cards a one-time use?

How to install skyrim using wine?

Lost equipment quest?