How to read a large file

Как прочитать большой файл построчно?

Я хочу выполнить итерацию по каждой строке всего файла. Один из способов сделать это - прочитать весь файл, сохранить его в списке, затем перейти к интересующей строке. Этот метод использует много памяти, поэтому я ищу альтернативу.

Мой код пока:

for each_line in fileinput.input(input_file):
    do_something(each_line)

    for each_line_again in fileinput.input(input_file):
        do_something(each_line_again)

Выполнение этого кода выдает сообщение об ошибке: device active.

Есть предложения?

Цель состоит в том, чтобы вычислить попарное сходство строк, что означает, что для каждой строки в файле я хочу вычислить расстояние Левенштейна с каждой другой строкой.

Ноябрь 2022 Правка: Связанный вопрос, который был задан через 8 месяцев после этого вопроса, содержит много полезных ответов и комментариев. Чтобы получить более глубокое понимание логики Python, также прочитайте этот связанный с этим вопрос Как я должен читать файл построчно в Python?

Переведено автоматически

Ответ 1

Правильный, полностью Pythonic способ чтения файла следующий:

with open(...) as f:
    for line in f:
        # Do something with 'line'

Оператор with обрабатывает открытие и закрытие файла, в том числе, если во внутреннем блоке возникает исключение. for line in f Обрабатывает файловый объект f как итерируемый, который автоматически использует буферизованный ввод-вывод и управление памятью, поэтому вам не нужно беспокоиться о больших файлах.

Должен быть один - и предпочтительно только один - очевидный способ сделать это.

Ответ 2

Два способа экономии памяти в ранжированном порядке (лучше всего первый) -

использование with - поддерживается в python 2.5 и выше

используйте yield, если вы действительно хотите контролировать объем чтения

1. использование `with`

with это хороший и эффективный pythonic способ чтения больших файлов. преимущества - 1) файловый объект автоматически закрывается после выхода из with блока выполнения. 2) обработка исключений внутри with блока. 3) память for цикл перебирает f файловый объект построчно. внутренне он выполняет буферизованный ввод-вывод (для оптимизации дорогостоящих операций ввода-вывода) и управление памятью.

with open("x.txt") as f:
    for line in f:
        do something with data

2. использование `yield`

Иногда может потребоваться более детальный контроль над объемом чтения на каждой итерации. В этом случае используйте iter & yield. Обратите внимание, что при использовании этого метода явно требуется закрыть файл в конце.

def readInChunks(fileObj, chunkSize=2048):
    """
    Lazy function to read a file piece by piece.
    Default chunk size: 2kB.

    """
    while True:
        data = fileObj.read(chunkSize)
        if not data:
            break
        yield data

f = open('bigFile')
for chunk in readInChunks(f):
    do_something(chunk)
f.close()

Pitfalls and for the sake of completeness - below methods are not as good or not as elegant for reading large files but please read to get rounded understanding.

In Python, the most common way to read lines from a file is to do the following:

for line in open('myfile','r').readlines():
    do_something(line)

When this is done, however, the readlines() function (same applies for read() function) loads the entire file into memory, then iterates over it. A slightly better approach (the first mentioned two methods above are the best) for large files is to use the fileinput module, as follows:

import fileinput

for line in fileinput.input(['myfile']):
    do_something(line)

the fileinput.input() call reads lines sequentially, but doesn't keep them in memory after they've been read or even simply so this, since file in python is iterable.

References

Python with statement

Ответ 3

To strip newlines:

with open(file_path, 'rU') as f:
    for line_terminated in f:
        line = line_terminated.rstrip('\n')
        ...

With universal newline support all text file lines will seem to be terminated with '\n', whatever the terminators in the file, '\r', '\n', or '\r\n'.

EDIT - To specify universal newline support:

Python 2 on Unix - open(file_path, mode='rU') - required ^{[thanks @Dave]}

Python 2 on Windows - open(file_path, mode='rU') - optional

Python 3 - open(file_path, newline=None) - optional

The newline parameter is only supported in Python 3 and defaults to None, which indicates universal newline mode (input file can have any newline, output string gets \n). The mode parameter defaults to 'r' in all cases. The U is deprecated in Python 3. In Python 2 on Windows some other mechanism appears to translate \r\n to \n.

Docs:

open() for Python 2

open() for Python 3

To preserve native line terminators:

with open(file_path, 'rb') as f:
    with line_native_terminated in f:
        ...

Binary mode can still parse the file into lines with in. Each line will have whatever terminators it has in the file.

Thanks to @katrielalex's answer, Python's open() doc, and iPython experiments.

Ответ 4

this is a possible way of reading a file in python:

f = open(input_file)
for line in f:
    do_stuff(line)
f.close()

it does not allocate a full list. It iterates over the lines.

How to read a large file - line by line?

1. использование with

2. использование yield

References

To strip newlines:

To preserve native line terminators:

1. использование `with`

2. использование `yield`