Sqlalchemy和多线程和异步爬虫

· 2018-09-08 · # SQLAlchemy # 爬虫

多线程的问题

开始想用多线程来加速爬虫，发现SQLAlchemy的session不是线程安全的，但是有个ScopedSession可以为每个线程创建一个session。

session_factory = sessionmaker(bind=engine)
Session = scoped_session(session_factory)

some_session = Session()

some_session.add(...)
some_session.commit()

Session.remove()

又遇到了连接数的问题：QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30
连接池默认最大连接数为5，需要再创建engine的时候设定。

不过这个东西不是很好用，后面想到用协程异步执行就简单多了。

协程

改动获取数据的代码，使用aiohttp库，调用的地方都加上await。


async def get_json_data(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            result = await response.text()
            return json.loads(result)

主函数改成协程，通过loop执行。


async def start_article(story):
    ...
    article_data = await get_json_data(api_url)
    ...
    
    
def main():
    ...
    tasks = [start_article(story) for story in stories]
    loop.run_until_complete(asyncio.wait(tasks))

测试了一下，原来抓取文章需要17秒左右的，现在7秒，提升还是挺大的，毕竟还有数据库请求的时间。

参考

https://stackoverflow.com/questions/6297404/multi-threaded-use-of-sqlalchemy
https://stackoverflow.com/questions/3360951/sql-alchemy-connection-time-out/28040482
https://juejin.im/post/5b430456e51d45198a2ea433